3.1.3 线性模型、多因素和因素分析 - 3.1.3.1 用“公式” 来在Python中指定统计模型 - 《SciPy Lecture Notes 中文版（Python科学计算生态的介绍的中文翻译）》

3.1.3.1 用“公式” 来在Python中指定统计模型
- 3.1.3.1.1 简单线性回归
- 3.1.3.1.2 类别变量: 比较组或多个类别

3.1.3.1 用“公式” 来在Python中指定统计模型

3.1.3.1.1 简单线性回归

给定两组观察值，x和y, 我们想要检验假设y是x的线性函数，换句话说:

$y = x * coef + intercept + e$

其中$e$是观察噪音。我们将使用statmodels module:

拟合一个线性模型。我们将使用简单的策略，普通最小二乘 (OLS)。
检验系数是否是非0。

首先，我们生成模型的虚拟数据:

In [9]:

import numpy as np
x = np.linspace(-5, 5, 20)
np.random.seed(1)
# normal distributed noise
y = -5 + 3*x + 4 * np.random.normal(size=x.shape)
# Create a data frame containing all the relevant variables
data = pandas.DataFrame({'x': x, 'y': y})

Python中的统计公式
见statsmodels文档

然后我们指定一个OLS模型并且拟合它:

In [10]:

from statsmodels.formula.api import ols
model = ols("y ~ x", data).fit()

我们可以检查fit产生的各种统计量:

In [26]:

print(model.summary())

                            OLS Regression Results                            
 ==============================================================================
Dep. Variable:                      y   R-squared:                       0.804
Model:                            OLS   Adj. R-squared:                  0.794
Method:                 Least Squares   F-statistic:                     74.03
Date:                Wed, 18 Nov 2015   Prob (F-statistic):           8.56e-08
Time:                        17:10:03   Log-Likelihood:                -57.988
No. Observations:                  20   AIC:                             120.0
Df Residuals:                      18   BIC:                             122.0
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
 ==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
 ------------- ------------- ------------- ------------- ------------- -------------
Intercept     -5.5335      1.036     -5.342      0.000        -7.710    -3.357
x              2.9369      0.341      8.604      0.000         2.220     3.654
 ==============================================================================
Omnibus:                        0.100   Durbin-Watson:                   2.956
Prob(Omnibus):                  0.951   Jarque-Bera (JB):                0.322
Skew:                          -0.058   Prob(JB):                        0.851
Kurtosis:                       2.390   Cond. No.                         3.03
 ==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

术语:
Statsmodels使用统计术语: statsmodel的y变量被称为‘endogenous’而x变量被称为exogenous。更详细的讨论见这里。
为了简化，y (endogenous) 是你要尝试预测的值，而 x (exogenous) 代表用来进行这个预测的特征。
练习
从以上模型中取回估计参数。提示: 使用tab-完成来找到相关的属性。

3.1.3.1.2 类别变量: 比较组或多个类别

让我们回到大脑大小的数据:

In [27]:

data = pandas.read_csv('examples/brain_size.csv', sep=';', na_values=".")

我们可以写一个比较，用线性模型比较男女IQ:

In [28]:

model = ols("VIQ ~ Gender + 1", data).fit()
print(model.summary())

                            OLS Regression Results                            
 ==============================================================================
Dep. Variable:                    VIQ   R-squared:                       0.015
Model:                            OLS   Adj. R-squared:                 -0.010
Method:                 Least Squares   F-statistic:                    0.5969
Date:                Wed, 18 Nov 2015   Prob (F-statistic):              0.445
Time:                        17:34:10   Log-Likelihood:                -182.42
No. Observations:                  40   AIC:                             368.8
Df Residuals:                      38   BIC:                             372.2
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
 ==================================================================================
                     coef    std err          t      P>|t|      [95.0% Conf. Int.]
 ------------- ------------- ------------- ------------- ------------- -----------------
Intercept        109.4500      5.308     20.619      0.000        98.704   120.196
Gender[T.Male]     5.8000      7.507      0.773      0.445        -9.397    20.997
 ==============================================================================
Omnibus:                       26.188   Durbin-Watson:                   1.709
Prob(Omnibus):                  0.000   Jarque-Bera (JB):                3.703
Skew:                           0.010   Prob(JB):                        0.157
Kurtosis:                       1.510   Cond. No.                         2.62
 ==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

特定模型的提示

强制类别: ‘Gender’ 被自动识别为类别变量，因此，它的每一个不同值都被处理为不同的实体。使用:

In [29]:

model = ols('VIQ ~ C(Gender)', data).fit()

可以将一个整数列强制作为类别处理。

截距: 我们可以在公式中用-1删除截距，或者用+1强制使用截距。

默认，statsmodel将带有K和可能值的类别变量处理为K-1'虚拟变量' (最后一个水平被吸收到截距项中)。在绝大多数情况下，这都是很好的默认选择 - 但是，为类别变量指定不同的编码方式也是可以的 (http://statsmodels.sourceforge.net/devel/contrasts.html)。。)

FSIQ和PIQ差异的t-检验

要比较不同类型的IQ，我们需要创建一个"长形式"的表格，用一个类别变量来标识IQ类型:

In [30]:

data_fisq = pandas.DataFrame({'iq': data['FSIQ'], 'type': 'fsiq'})
data_piq = pandas.DataFrame({'iq': data['PIQ'], 'type': 'piq'})
data_long = pandas.concat((data_fisq, data_piq))
print(data_long)

     iq  type
0   133  fsiq
1   140  fsiq
2   139  fsiq
3   133  fsiq
4   137  fsiq
5    99  fsiq
6   138  fsiq
7    92  fsiq
8    89  fsiq
9   133  fsiq
10  132  fsiq
11  141  fsiq
12  135  fsiq
13  140  fsiq
14   96  fsiq
15   83  fsiq
16  132  fsiq
17  100  fsiq
18  101  fsiq
19   80  fsiq
20   83  fsiq
21   97  fsiq
22  135  fsiq
23  139  fsiq
24   91  fsiq
25  141  fsiq
26   85  fsiq
27  103  fsiq
28   77  fsiq
29  130  fsiq
..  ...   ...
10  124   piq
11  128   piq
12  124   piq
13  147   piq
14   90   piq
15   96   piq
16  120   piq
17  102   piq
18   84   piq
19   86   piq
20   86   piq
21   84   piq
22  134   piq
23  128   piq
24  102   piq
25  131   piq
26   84   piq
27  110   piq
28   72   piq
29  124   piq
30  132   piq
31  137   piq
32  110   piq
33   86   piq
34   81   piq
35  128   piq
36  124   piq
37   94   piq
38   74   piq
39   89   piq
[80 rows x 2 columns]

In [31]:

model = ols("iq ~ type", data_long).fit()
print(model.summary())

                            OLS Regression Results                            
 ==============================================================================
Dep. Variable:                     iq   R-squared:                       0.003
Model:                            OLS   Adj. R-squared:                 -0.010
Method:                 Least Squares   F-statistic:                    0.2168
Date:                Wed, 18 Nov 2015   Prob (F-statistic):              0.643
Time:                        18:16:40   Log-Likelihood:                -364.35
No. Observations:                  80   AIC:                             732.7
Df Residuals:                      78   BIC:                             737.5
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
 ===============================================================================
                  coef    std err          t      P>|t|      [95.0% Conf. Int.]
 ------------- ------------- ------------- ------------- ------------- --------------
Intercept     113.4500      3.683     30.807      0.000       106.119   120.781
type[T.piq]    -2.4250      5.208     -0.466      0.643       -12.793     7.943
 ==============================================================================
Omnibus:                      164.598   Durbin-Watson:                   1.531
Prob(Omnibus):                  0.000   Jarque-Bera (JB):                8.062
Skew:                          -0.110   Prob(JB):                       0.0178
Kurtosis:                       1.461   Cond. No.                         2.62
 ==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

我们可以看到我们获得了与前面t-检验相同的值，以及相同的对应iq type的p-值:

In [32]:

stats.ttest_ind(data['FSIQ'], data['PIQ'])

Out[32]:

(0.46563759638096403, 0.64277250094148408)