第05章布尔索引 - 1. 计算布尔值统计信息 - 《Pandas Cookbook 带注释源码》

1. 计算布尔值统计信息

#  读取movie，设定行索引是movie_title
 In[2]: pd.options.display.max_columns = 50
 In[3]: movie = pd.read_csv('data/movie.csv', index_col='movie_title')
        movie.head()
Out[3]:

#  判断电影时长是否超过两小时
 In[4]: movie_2_hours = movie['duration'] > 120
        movie_2_hours.head(10)
Out[4]: movie_title
        Avatar                                         True
        Pirates of the Caribbean: At World's End       True
        Spectre                                        True
        The Dark Knight Rises                          True
        Star Wars: Episode VII - The Force Awakens    False
        John Carter                                    True
        Spider-Man 3                                   True
        Tangled                                       False
        Avengers: Age of Ultron                        True
        Harry Potter and the Half-Blood Prince         True
        Name: duration, dtype: bool

#  有多少时长超过两小时的电影
 In[5]: movie_2_hours.sum()
Out[5]: 1039

#  超过两小时的电影的比例
 In[6]: movie_2_hours.mean()
Out[6]: 0.21135069161920261

#  用describe()输出一些该布尔Series信息
 In[7]: movie_2_hours.describe()
Out[7]: count      4916
        unique        2
        top       False
        freq       3877
        Name: duration, dtype: object

#  实际上，dureation这列是有缺失值的，要想获得真正的超过两小时的电影的比例，需要先删掉缺失值
 In[8]: movie['duration'].dropna().gt(120).mean()
Out[8]: 0.21199755152009794

原理

#  统计False和True值的比例
 In[9]: movie_2_hours.value_counts(normalize=True)
Out[9]: False    0.788649
        True     0.211351
        Name: duration, dtype: float64

#  比较同一个DataFrame中的两列
 In[10]: actors = movie[['actor_1_facebook_likes', 'actor_2_facebook_likes']].dropna()
         (actors['actor_1_facebook_likes'] > actors['actor_2_facebook_likes']).mean()
Out[10]: 0.97776871303283708

1. 计算布尔值统计信息

1. 计算布尔值统计信息

原理

更多