概率统计方法

概率统计方法

简介

Python 中常用的统计工具有 Numpy, Pandas, PyMC, StatsModels 等。

Scipy 中的子库 scipy.stats 中包含很多统计上的方法。

导入 numpy 和 matplotlib：

In [1]:

%pylab inline

Populating the interactive namespace from numpy and matplotlib

In [2]:

heights = array([1.46, 1.79, 2.01, 1.75, 1.56, 1.69, 1.88, 1.76, 1.88, 1.78])

Numpy 自带简单的统计方法：

In [3]:

print 'mean, ', heights.mean()
print 'min, ', heights.min()
print 'max, ', heights.max()
print 'standard deviation, ', heights.std()

mean,  1.756
min,  1.46
max,  2.01
standard deviation,  0.150811140172

导入 Scipy 的统计模块：

In [4]:

import scipy.stats.stats as st

其他统计量：

In [5]:

print 'median, ', st.nanmedian(heights)    # 忽略nan值之后的中位数
print 'mode, ', st.mode(heights)           # 众数及其出现次数
print 'skewness, ', st.skew(heights)       # 偏度
print 'kurtosis, ', st.kurtosis(heights)   # 峰度
print 'and so many more...'

median,  1.77
mode,  (array([ 1.88]), array([ 2.]))
skewness,  -0.393524456473
kurtosis,  -0.330672097724
and so many more...

概率分布

常见的连续概率分布有：

均匀分布
正态分布
学生t分布
F分布
Gamma分布
… 离散概率分布：
伯努利分布
几何分布
… 这些都可以在 scipy.stats 中找到。

连续分布

正态分布

以正态分布为例，先导入正态分布：

In [6]:

from scipy.stats import norm

它包含四类常用的函数：

norm.cdf 返回对应的累计分布函数值
norm.pdf 返回对应的概率密度函数值
norm.rvs 产生指定参数的随机变量
norm.fit 返回给定数据下，各参数的最大似然估计（MLE）值从正态分布产生500个随机点：

In [7]:

x_norm = norm.rvs(size=500)
type(x_norm)

Out[7]:

numpy.ndarray

直方图：

In [8]:

h = hist(x_norm)
print 'counts, ', h[0]
print 'bin centers', h[1]

counts,  [   7.   21.   42.   97.  120.   91.   64.   38.   17.    3.]
bin centers [-2.68067801 -2.13266147 -1.58464494 -1.0366284  -0.48861186  0.05940467
  0.60742121  1.15543774  1.70345428  2.25147082  2.79948735]

归一化直方图（用出现频率代替次数），将划分区间变为 20（默认 10）：

In [9]:

h = hist(x_norm, normed=True, bins=20)

在这组数据下，正态分布参数的最大似然估计值为：

In [10]:

x_mean, x_std = norm.fit(x_norm)
 
print 'mean, ', x_mean
print 'x_std, ', x_std

mean,  -0.0426135499965
x_std,  0.950754110144

将真实的概率密度函数与直方图进行比较：

In [11]:

h = hist(x_norm, normed=True, bins=20)
 
x = linspace(-3,3,50)
p = plot(x, norm.pdf(x), 'r-')

导入积分函数：

In [12]:

from scipy.integrate import trapz

通过积分，计算落在某个区间的概率大小：

In [13]:

x1 = linspace(-2,2,108)
p = trapz(norm.pdf(x1), x1) 
print '{:.2%} of the values lie between -2 and 2'.format(p)
 
fill_between(x1, norm.pdf(x1), color = 'red')
plot(x, norm.pdf(x), 'k-')

95.45% of the values lie between -2 and 2

Out[13]:

[<matplotlib.lines.Line2D at 0x15cbb8d0>]

默认情况，正态分布的参数为均值0，标准差1，即标准正态分布。

可以通过 loc 和 scale 来调整这些参数，一种方法是调用相关函数时进行输入：

In [14]:

p = plot(x, norm.pdf(x, loc=0, scale=1))
p = plot(x, norm.pdf(x, loc=0.5, scale=2))
p = plot(x, norm.pdf(x, loc=-0.5, scale=.5))

另一种则是将 loc, scale 作为参数直接输给 norm 生成相应的分布：

In [15]:

p = plot(x, norm(loc=0, scale=1).pdf(x))
p = plot(x, norm(loc=0.5, scale=2).pdf(x))
p = plot(x, norm(loc=-0.5, scale=.5).pdf(x))

其他连续分布

In [16]:

from scipy.stats import lognorm, t, dweibull

支持与 norm 类似的操作，如概率密度函数等。

不同参数的对数正态分布：

In [17]:

x = linspace(0.01, 3, 100)
 
plot(x, lognorm.pdf(x, 1), label='s=1')
plot(x, lognorm.pdf(x, 2), label='s=2')
plot(x, lognorm.pdf(x, .1), label='s=0.1')
 
legend()

Out[17]:

<matplotlib.legend.Legend at 0x15781c88>

不同的韦氏分布：

In [18]:

x = linspace(0.01, 3, 100)
 
plot(x, dweibull.pdf(x, 1), label='s=1, constant failure rate')
plot(x, dweibull.pdf(x, 2), label='s>1, increasing failure rate')
plot(x, dweibull.pdf(x, .1), label='0<s<1, decreasing failure rate')
 
legend()

Out[18]:

<matplotlib.legend.Legend at 0xaa9bc50>

不同自由度的学生 t 分布：

In [19]:

x = linspace(-3, 3, 100)
 
plot(x, t.pdf(x, 1), label='df=1')
plot(x, t.pdf(x, 2), label='df=2')
plot(x, t.pdf(x, 100), label='df=100')
plot(x[::5], norm.pdf(x[::5]), 'kx', label='normal')
 
legend()

Out[19]:

<matplotlib.legend.Legend at 0x164582e8>

离散分布

导入离散分布：

In [20]:

from scipy.stats import binom, poisson, randint

离散分布没有概率密度函数，但是有概率质量函数。

离散均匀分布的概率质量函数（PMF）：

In [21]:

high = 10
low = -10
 
x = arange(low, high+1, 0.5)
p = stem(x, randint(low, high).pmf(x))  # 杆状图

二项分布：

In [22]:

num_trials = 60
x = arange(num_trials)
 
plot(x, binom(num_trials, 0.5).pmf(x), 'o-', label='p=0.5')
plot(x, binom(num_trials, 0.2).pmf(x), 'o-', label='p=0.2')
 
legend()

Out[22]:

<matplotlib.legend.Legend at 0x1738a198>

泊松分布：

In [23]:

x = arange(0,21)
 
plot(x, poisson(1).pmf(x), 'o-', label=r'$\lambda$=1')
plot(x, poisson(4).pmf(x), 'o-', label=r'$\lambda$=4')
plot(x, poisson(9).pmf(x), 'o-', label=r'$\lambda$=9')
 
legend()

Out[23]:

<matplotlib.legend.Legend at 0x1763e320>

自定义离散分布

导入要用的函数：

In [24]:

from scipy.stats import rv_discrete

一个不均匀的骰子对应的离散值及其概率：

In [25]:

xk = [1, 2, 3, 4, 5, 6]
pk = [.3, .35, .25, .05, .025, .025]

定义离散分布：

In [26]:

loaded = rv_discrete(values=(xk, pk))

此时， loaded 可以当作一个离散分布的模块来使用。

产生两个服从该分布的随机变量：

In [27]:

loaded.rvs(size=2)

Out[27]:

array([3, 1])

产生100个随机变量，将直方图与概率质量函数进行比较：

In [28]:

samples = loaded.rvs(size=100)
bins = linspace(.5,6.5,7)
 
hist(samples, bins=bins, normed=True)
stem(xk, loaded.pmf(xk), markerfmt='ro', linefmt='r-')

Out[28]:

<Container object of 3 artists>

假设检验

导入相关的函数：

正态分布
独立双样本 t 检验，配对样本 t 检验，单样本 t 检验
学生 t 分布 t 检验的相关内容请参考：
百度百科-t 检验：http://baike.baidu.com/view/557340.htm
维基百科-学生 t 检验：https://en.wikipedia.org/wiki/Student%27s_t-test

In [29]:

from scipy.stats import norm
from scipy.stats import ttest_ind, ttest_rel, ttest_1samp
from scipy.stats import t

独立样本 t 检验

两组参数不同的正态分布：

In [30]:

n1 = norm(loc=0.3, scale=1.0)
n2 = norm(loc=0, scale=1.0)

从分布中产生两组随机样本：

In [31]:

n1_samples = n1.rvs(size=100)
n2_samples = n2.rvs(size=100)

将两组样本混合在一起：

In [32]:

samples = hstack((n1_samples, n2_samples))

最大似然参数估计：

In [33]:

loc, scale = norm.fit(samples)
n = norm(loc=loc, scale=scale)

比较：

In [34]:

x = linspace(-3,3,100)
 
hist([samples, n1_samples, n2_samples], normed=True)
plot(x, n.pdf(x), 'b-')
plot(x, n1.pdf(x), 'g-')
plot(x, n2.pdf(x), 'r-')

Out[34]:

[<matplotlib.lines.Line2D at 0x17ca7278>]

独立双样本 t 检验的目的在于判断两组样本之间是否有显著差异：

In [35]:

t_val, p = ttest_ind(n1_samples, n2_samples)
 
print 't = {}'.format(t_val)
print 'p-value = {}'.format(p)

t = 0.868384594123
p-value = 0.386235148899

p 值小，说明这两个样本有显著性差异。

配对样本 t 检验

配对样本指的是两组样本之间的元素一一对应，例如，假设我们有一组病人的数据：

In [36]:

pop_size = 35
 
pre_treat = norm(loc=0, scale=1)
n0 = pre_treat.rvs(size=pop_size)

经过某种治疗后，对这组病人得到一组新的数据：

In [37]:

effect = norm(loc=0.05, scale=0.2)
eff = effect.rvs(size=pop_size)
 
n1 = n0 + eff

新数据的最大似然估计：

In [38]:

loc, scale = norm.fit(n1)
post_treat = norm(loc=loc, scale=scale)

画图：

In [39]:

fig = figure(figsize=(10,4))
 
ax1 = fig.add_subplot(1,2,1)
h = ax1.hist([n0, n1], normed=True)
p = ax1.plot(x, pre_treat.pdf(x), 'b-')
p = ax1.plot(x, post_treat.pdf(x), 'g-')
 
ax2 = fig.add_subplot(1,2,2)
h = ax2.hist(eff, normed=True)

独立 t 检验：

In [40]:

t_val, p = ttest_ind(n0, n1)
 
print 't = {}'.format(t_val)
print 'p-value = {}'.format(p)

t = -0.347904839913
p-value = 0.728986322039

高 p 值说明两组样本之间没有显著性差异。

配对 t 检验：

In [41]:

t_val, p = ttest_rel(n0, n1)
 
print 't = {}'.format(t_val)
print 'p-value = {}'.format(p)

t = -1.89564459709
p-value = 0.0665336223673

配对 t 检验的结果说明，配对样本之间存在显著性差异，说明治疗时有效的，符合我们的预期。

p 值计算原理

p 值对应的部分是下图中的红色区域，边界范围由 t 值决定。

In [42]:

my_t = t(pop_size) # 传入参数为自由度，这里自由度为50
 
p = plot(x, my_t.pdf(x), 'b-')
lower_x = x[x<= -abs(t_val)]
upper_x = x[x>= abs(t_val)]
 
p = fill_between(lower_x, my_t.pdf(lower_x), color='red')
p = fill_between(upper_x, my_t.pdf(upper_x), color='red')

原文: https://nbviewer.jupyter.org/github/lijin-THU/notes-python/blob/master/04-scipy/04.03-statistics-with-scipy.ipynb

04.03 概率统计方法