3.1.3.1 用“公式” 来在Python中指定统计模型
3.1.3.1.1 简单线性回归
给定两组观察值,x和y, 我们想要检验假设y是x的线性函数,换句话说:
$y = x * coef + intercept + e$
其中$e$是观察噪音。我们将使用statmodels module:
- 拟合一个线性模型。我们将使用简单的策略,普通最小二乘 (OLS)。
- 检验系数是否是非0。
首先,我们生成模型的虚拟数据:
In [9]:
import numpy as np
x = np.linspace(-5, 5, 20)
np.random.seed(1)
# normal distributed noise
y = -5 + 3*x + 4 * np.random.normal(size=x.shape)
# Create a data frame containing all the relevant variables
data = pandas.DataFrame({'x': x, 'y': y})
Python中的统计公式
然后我们指定一个OLS模型并且拟合它:
In [10]:
from statsmodels.formula.api import ols
model = ols("y ~ x", data).fit()
我们可以检查fit产生的各种统计量:
In [26]:
print(model.summary())
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.804
Model: OLS Adj. R-squared: 0.794
Method: Least Squares F-statistic: 74.03
Date: Wed, 18 Nov 2015 Prob (F-statistic): 8.56e-08
Time: 17:10:03 Log-Likelihood: -57.988
No. Observations: 20 AIC: 120.0
Df Residuals: 18 BIC: 122.0
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------- ------------- ------------- ------------- ------------- -------------
Intercept -5.5335 1.036 -5.342 0.000 -7.710 -3.357
x 2.9369 0.341 8.604 0.000 2.220 3.654
==============================================================================
Omnibus: 0.100 Durbin-Watson: 2.956
Prob(Omnibus): 0.951 Jarque-Bera (JB): 0.322
Skew: -0.058 Prob(JB): 0.851
Kurtosis: 2.390 Cond. No. 3.03
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
术语:
Statsmodels使用统计术语: statsmodel的y变量被称为‘endogenous’而x变量被称为exogenous。更详细的讨论见这里。
为了简化,y (endogenous) 是你要尝试预测的值,而 x (exogenous) 代表用来进行这个预测的特征。
练习
从以上模型中取回估计参数。提示: 使用tab-完成来找到相关的属性。
3.1.3.1.2 类别变量: 比较组或多个类别
让我们回到大脑大小的数据:
In [27]:
data = pandas.read_csv('examples/brain_size.csv', sep=';', na_values=".")
我们可以写一个比较,用线性模型比较男女IQ:
In [28]:
model = ols("VIQ ~ Gender + 1", data).fit()
print(model.summary())
OLS Regression Results
==============================================================================
Dep. Variable: VIQ R-squared: 0.015
Model: OLS Adj. R-squared: -0.010
Method: Least Squares F-statistic: 0.5969
Date: Wed, 18 Nov 2015 Prob (F-statistic): 0.445
Time: 17:34:10 Log-Likelihood: -182.42
No. Observations: 40 AIC: 368.8
Df Residuals: 38 BIC: 372.2
Df Model: 1
Covariance Type: nonrobust
==================================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------- ------------- ------------- ------------- ------------- -----------------
Intercept 109.4500 5.308 20.619 0.000 98.704 120.196
Gender[T.Male] 5.8000 7.507 0.773 0.445 -9.397 20.997
==============================================================================
Omnibus: 26.188 Durbin-Watson: 1.709
Prob(Omnibus): 0.000 Jarque-Bera (JB): 3.703
Skew: 0.010 Prob(JB): 0.157
Kurtosis: 1.510 Cond. No. 2.62
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
特定模型的提示
强制类别: ‘Gender’ 被自动识别为类别变量,因此,它的每一个不同值都被处理为不同的实体。 使用:
In [29]:
model = ols('VIQ ~ C(Gender)', data).fit()
可以将一个整数列强制作为类别处理。
截距: 我们可以在公式中用-1删除截距,或者用+1强制使用截距。
默认,statsmodel将带有K和可能值的类别变量处理为K-1'虚拟变量' (最后一个水平被吸收到截距项中)。在绝大多数情况下,这都是很好的默认选择 - 但是,为类别变量指定不同的编码方式也是可以的 (http://statsmodels.sourceforge.net/devel/contrasts.html)。。)
FSIQ和PIQ差异的t-检验
要比较不同类型的IQ,我们需要创建一个"长形式"的表格,用一个类别变量来标识IQ类型:
In [30]:
data_fisq = pandas.DataFrame({'iq': data['FSIQ'], 'type': 'fsiq'})
data_piq = pandas.DataFrame({'iq': data['PIQ'], 'type': 'piq'})
data_long = pandas.concat((data_fisq, data_piq))
print(data_long)
iq type
0 133 fsiq
1 140 fsiq
2 139 fsiq
3 133 fsiq
4 137 fsiq
5 99 fsiq
6 138 fsiq
7 92 fsiq
8 89 fsiq
9 133 fsiq
10 132 fsiq
11 141 fsiq
12 135 fsiq
13 140 fsiq
14 96 fsiq
15 83 fsiq
16 132 fsiq
17 100 fsiq
18 101 fsiq
19 80 fsiq
20 83 fsiq
21 97 fsiq
22 135 fsiq
23 139 fsiq
24 91 fsiq
25 141 fsiq
26 85 fsiq
27 103 fsiq
28 77 fsiq
29 130 fsiq
.. ... ...
10 124 piq
11 128 piq
12 124 piq
13 147 piq
14 90 piq
15 96 piq
16 120 piq
17 102 piq
18 84 piq
19 86 piq
20 86 piq
21 84 piq
22 134 piq
23 128 piq
24 102 piq
25 131 piq
26 84 piq
27 110 piq
28 72 piq
29 124 piq
30 132 piq
31 137 piq
32 110 piq
33 86 piq
34 81 piq
35 128 piq
36 124 piq
37 94 piq
38 74 piq
39 89 piq
[80 rows x 2 columns]
In [31]:
model = ols("iq ~ type", data_long).fit()
print(model.summary())
OLS Regression Results
==============================================================================
Dep. Variable: iq R-squared: 0.003
Model: OLS Adj. R-squared: -0.010
Method: Least Squares F-statistic: 0.2168
Date: Wed, 18 Nov 2015 Prob (F-statistic): 0.643
Time: 18:16:40 Log-Likelihood: -364.35
No. Observations: 80 AIC: 732.7
Df Residuals: 78 BIC: 737.5
Df Model: 1
Covariance Type: nonrobust
===============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------- ------------- ------------- ------------- ------------- --------------
Intercept 113.4500 3.683 30.807 0.000 106.119 120.781
type[T.piq] -2.4250 5.208 -0.466 0.643 -12.793 7.943
==============================================================================
Omnibus: 164.598 Durbin-Watson: 1.531
Prob(Omnibus): 0.000 Jarque-Bera (JB): 8.062
Skew: -0.110 Prob(JB): 0.0178
Kurtosis: 1.461 Cond. No. 2.62
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
我们可以看到我们获得了与前面t-检验相同的值,以及相同的对应iq type的p-值:
In [32]:
stats.ttest_ind(data['FSIQ'], data['PIQ'])
Out[32]:
(0.46563759638096403, 0.64277250094148408)