13.3 statsmodels介绍
- 估计线性模型
- 估计时间序列过程

13.3 statsmodels介绍

statsmodels是Python进行拟合多种统计模型、进行统计试验和数据探索可视化的库。Statsmodels包含许多经典的统计方法，但没有贝叶斯方法和机器学习模型。

statsmodels包含的模型有：

线性模型，广义线性模型和健壮线性模型
线性混合效应模型
方差（ANOVA）方法分析
时间序列过程和状态空间模型
广义矩估计

下面，我会使用一些基本的statsmodels工具，探索Patsy公式和pandasDataFrame对象如何使用模型接口。

估计线性模型

statsmodels有多种线性回归模型，包括从基本（比如普通最小二乘）到复杂（比如迭代加权最小二乘法）的。

statsmodels的线性模型有两种不同的接口：基于数组和基于公式。它们可以通过API模块引入：

import statsmodels.api as sm
import statsmodels.formula.api as smf

为了展示它们的使用方法，我们从一些随机数据生成一个线性模型：

def dnorm(mean, variance, size=1):
    if isinstance(size, int):
        size = size,
    return mean + np.sqrt(variance) * np.random.randn(*size)
# For reproducibility
np.random.seed(12345)
N = 100
X = np.c_[dnorm(0, 0.4, size=N),
          dnorm(0, 0.6, size=N),
          dnorm(0, 0.2, size=N)]
eps = dnorm(0, 0.1, size=N)
beta = [0.1, 0.3, 0.5]
y = np.dot(X, beta) + eps

这里，我使用了“真实”模型和可知参数beta。此时，dnorm可用来生成正态分布数据，带有特定均值和方差。现在有：

In [66]: X[:5]
Out[66]: 
array([[-0.1295, -1.2128,  0.5042],
       [ 0.3029, -0.4357, -0.2542],
       [-0.3285, -0.0253,  0.1384],
       [-0.3515, -0.7196, -0.2582],
       [ 1.2433, -0.3738, -0.5226]])
In [67]: y[:5]
Out[67]: array([ 0.4279, -0.6735, -0.0909, -0.4895,-0.1289])

像之前Patsy看到的，线性模型通常要拟合一个截距。sm.add_constant函数可以添加一个截距的列到现存的矩阵：

In [68]: X_model = sm.add_constant(X)
In [69]: X_model[:5]
Out[69]: 
array([[ 1.    , -0.1295, -1.2128,  0.5042],
       [ 1.    ,  0.3029, -0.4357, -0.2542],
       [ 1.    , -0.3285, -0.0253,  0.1384],
       [ 1.    , -0.3515, -0.7196, -0.2582],
       [ 1.    ,  1.2433, -0.3738, -0.5226]])

sm.OLS类可以拟合一个普通最小二乘回归：

In [70]: model = sm.OLS(y, X)

这个模型的fit方法返回了一个回归结果对象，它包含估计的模型参数和其它内容：

In [71]: results = model.fit()
In [72]: results.params
Out[72]: array([ 0.1783,  0.223 ,  0.501 ])

对结果使用summary方法可以打印模型的详细诊断结果：

In [73]: print(results.summary())
OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.430
Model:                            OLS   Adj. R-squared:                  0.413
Method:                 Least Squares   F-statistic:                     24.42
Date:                Mon, 25 Sep 2017   Prob (F-statistic):           7.44e-12
Time:                        14:06:15   Log-Likelihood:                -34.305
No. Observations:                 100   AIC:                             74.61
Df Residuals:                      97   BIC:                             82.42
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
x1             0.1783      0.053      3.364      0.001       0.073       0.283
x2             0.2230      0.046      4.818      0.000       0.131       0.315
x3             0.5010      0.080      6.237      0.000       0.342       0.660
==============================================================================
Omnibus:                        4.662   Durbin-Watson:                   2.201
Prob(Omnibus):                  0.097   Jarque-Bera (JB):                4.098
Skew:                           0.481   Prob(JB):                        0.129
Kurtosis:                       3.243   Cond. No.
1.74
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly 
specified.

这里的参数名为通用名x1, x2等等。假设所有的模型参数都在一个DataFrame中：

In [74]: data = pd.DataFrame(X, columns=['col0', 'col1', 'col2'])
In [75]: data['y'] = y
In [76]: data[:5]
Out[76]: 
       col0      col1      col2         y
0 -0.129468 -1.212753  0.504225  0.427863
1  0.302910 -0.435742 -0.254180 -0.673480
2 -0.328522 -0.025302  0.138351 -0.090878
3 -0.351475 -0.719605 -0.258215 -0.489494
4  1.243269 -0.373799 -0.522629 -0.128941

现在，我们使用statsmodels的公式API和Patsy的公式字符串：

In [77]: results = smf.ols('y ~ col0 + col1 + col2', data=data).fit()
In [78]: results.params
Out[78]: 
Intercept    0.033559
col0         0.176149
col1         0.224826
col2         0.514808
dtype: float64
In [79]: results.tvalues
Out[79]: 
Intercept    0.952188
col0         3.319754
col1         4.850730
col2         6.303971
dtype: float64

观察下statsmodels是如何返回Series结果的，附带有DataFrame的列名。当使用公式和pandas对象时，我们不需要使用add_constant。

给出一个样本外数据，你可以根据估计的模型参数计算预测值：

In [80]: results.predict(data[:5])
Out[80]: 
0   -0.002327
1   -0.141904
2    0.041226
3   -0.323070
4   -0.100535
dtype: float64

statsmodels的线性模型结果还有其它的分析、诊断和可视化工具。除了普通最小二乘模型，还有其它的线性模型。

估计时间序列过程

statsmodels的另一模型类是进行时间序列分析，包括自回归过程、卡尔曼滤波和其它态空间模型，和多元自回归模型。

用自回归结构和噪声来模拟一些时间序列数据：

init_x = 4
import random
values = [init_x, init_x]
N = 1000
b0 = 0.8
b1 = -0.4
noise = dnorm(0, 0.1, N)
for i in range(N):
    new_x = values[-1] * b0 + values[-2] * b1 + noise[i]
    values.append(new_x)

这个数据有AR(2)结构（两个延迟），参数是0.8和-0.4。拟合AR模型时，你可能不知道滞后项的个数，因此可以用较多的滞后量来拟合这个模型：

In [82]: MAXLAGS = 5
In [83]: model = sm.tsa.AR(values)
In [84]: results = model.fit(MAXLAGS)

结果中的估计参数首先是截距，其次是前两个参数的估计值：

In [85]: results.params
Out[85]: array([-0.0062,  0.7845, -0.4085, -0.0136,  0.015 ,  0.0143])

更多的细节以及如何解释结果超出了本书的范围，可以通过statsmodels文档学习更多。