3.6 scikit-learn：Python中的机器学习 - 3.5.1 加载样例数据集 - 《SciPy Lecture Notes 中文版（Python科学计算生态的介绍的中文翻译）》

3.5.1 加载样例数据集
- 3.5.1.1 学习和预测

3.5.1 加载样例数据集

首先，我们将加载一些数据来玩玩。我们将使用的数据是知名的非常简单的花数据鸢尾花数据集。

我们有150个鸢尾花观察值指定了一些测量：花萼宽带、花萼长度、花瓣宽度和花瓣长度，以及对应的子类：Iris setosa、Iris versicolor和Iris virginica。

将数据集加载为Python对象：

In [1]:

from sklearn import datasets
iris = datasets.load_iris()

这个数据存储在.data成员中，是一个 (n_samples, n_features) 数组。

In [2]:

iris.data.shape

Out[2]:

(150, 4)

每个观察的类别存储在数据集的.target属性中。这是长度是n_samples的1D整型数组 :

In [3]:

iris.target.shape

Out[3]:

(150,)

In [4]:

import numpy as np
np.unique(iris.target)

Out[4]:

array([0, 1, 2])

数据重排的例子：digits 数据集

digits 数据集包含1797 图像，每一个是8X8像素的图片，代表一个手写的数字

In [15]:

digits = datasets.load_digits()
digits.images.shape

Out[15]:

(1797, 8, 8)

In [8]:

import pylab as pl
pl.imshow(digits.images[0], cmap=pl.cm.gray_r)

Out[8]:

<matplotlib.image.AxesImage at 0x109abd990>

要在scikit使用这个数据集，我们将每个8X8图片转化为一个长度为64的向量

In [9]:

data = digits.images.reshape((digits.images.shape[0], -1))

3.5.1.1 学习和预测

现在我们有了一些数据，我们想要从上面学习并且在新的数据做预测。在scikit-learn中，我们通过创建一个预测器，并调用他的 fit(X, Y) 方法从现有数据上学习。

In [11]:

from sklearn import svm
clf = svm.LinearSVC()
clf.fit(iris.data, iris.target) # 从数据学习

Out[11]:

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)

一旦我们从数据中学习，我们可以用我们的模型来预测未见过的数据的最可能输出:

In [12]:

clf.predict([[ 5.0,  3.6,  1.3,  0.25]])

Out[12]:

array([0])

注意：我们可以通过由下滑线结尾的属性来访问模型的参数:

In [13]:

clf.coef_

Out[13]:

array([[ 0.18424728,  0.45122657, -0.80794162, -0.45070597],
       [ 0.05691797, -0.89245895,  0.39682582, -0.92882381],
       [-0.85072494, -0.98678239,  1.38091241,  1.86550868]])