
作者: PaddlePaddle

日期: 2021.01

摘要: 本示例教程将会演示如何使用线性回归完成波士顿房价预测。


经典的线性回归模型主要用来预测一些存在着线性关系的数据集。回归模型可以理解为:存在一个点集,用一条曲线去拟合它分布的过程。如果拟合曲线是一条直线,则称为线性回归。如果是一条二次曲线,则被称为二次回归。线性回归是回归模型中最简单的一种。 本示例简要介绍如何用飞桨开源框架,实现波士顿房价预测。其思路是,假设uci-housing数据集中的房子属性和房价之间的关系可以被属性间的线性组合描述。在模型训练阶段,让假设的预测结果和真实值之间的误差越来越小。在模型预测阶段,预测器会读取训练好的模型,对从未遇见过的房子属性进行房价预测。


本教程基于Paddle 2.0 编写,如果您的环境不是本版本,请先参考官网安装 Paddle 2.0 。

  1. import paddle
  2. import numpy as np
  3. import os
  4. import matplotlib
  5. import matplotlib.pyplot as plt
  6. import pandas as pd
  7. import seaborn as sns
  8. print(paddle.__version__)
  1. 2.0.0






3.1 数据处理

  1. #下载数据
  2. !wget https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data -O housing.data
  1. --2021-01-27 18:04:47-- https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data
  2. 正在解析主机 archive.ics.uci.edu (archive.ics.uci.edu)...
  3. 正在连接 archive.ics.uci.edu (archive.ics.uci.edu)||:443... 已连接。
  4. 已发出 HTTP 请求,正在等待回应... 200 OK
  5. 长度:49082 (48K) [application/x-httpd-php]
  6. 正在保存至: housing.data
  7. housing.data 100%[===================>] 47.93K 157KB/s 用时 0.3s
  8. 2021-01-27 18:04:48 (157 KB/s) - 已保存 housing.data [49082/49082])
  1. # 从文件导入数据
  2. datafile = './housing.data'
  3. housing_data = np.fromfile(datafile, sep=' ')
  4. feature_names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE','DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
  5. feature_num = len(feature_names)
  6. # 将原始数据进行Reshape,变成[N, 14]这样的形状
  7. housing_data = housing_data.reshape([housing_data.shape[0] // feature_num, feature_num])
  1. # 画图看特征间的关系,主要是变量两两之间的关系(线性或非线性,有无明显较为相关关系)
  2. features_np = np.array([x[:13] for x in housing_data], np.float32)
  3. labels_np = np.array([x[-1] for x in housing_data], np.float32)
  4. # data_np = np.c_[features_np, labels_np]
  5. df = pd.DataFrame(housing_data, columns=feature_names)
  6. matplotlib.use('TkAgg')
  7. %matplotlib inline
  8. sns.pairplot(df.dropna(), y_vars=feature_names[-1], x_vars=feature_names[::-1], diag_kind='kde')
  9. plt.show()


  1. # 相关性分析
  2. fig, ax = plt.subplots(figsize=(15, 1))
  3. corr_data = df.corr().iloc[-1]
  4. corr_data = np.asarray(corr_data).reshape(1, 14)
  5. ax = sns.heatmap(corr_data, cbar=True, annot=True)
  6. plt.show()
3.2 数据归一化处理


  1. sns.boxplot(data=df.iloc[:, 0:13])
  1. <AxesSubplot:>



做归一化(或 Feature scaling)至少有以下2个理由:

  • 过大或过小的数值范围会导致计算时的浮点上溢或下溢。

  • 不同的数值范围会导致不同属性对模型的重要性不同(至少在训练的初始阶段如此),而这个隐含的假设常常是不合理的。这会对优化的过程造成困难,使训练时间大大的加长.

  1. features_max = housing_data.max(axis=0)
  2. features_min = housing_data.min(axis=0)
  3. features_avg = housing_data.sum(axis=0) / housing_data.shape[0]
  1. BATCH_SIZE = 20
  2. def feature_norm(input):
  3. f_size = input.shape
  4. output_features = np.zeros(f_size, np.float32)
  5. for batch_id in range(f_size[0]):
  6. for index in range(13):
  7. output_features[batch_id][index] = (input[batch_id][index] - features_avg[index]) / (features_max[index] - features_min[index])
  8. return output_features
  1. #只对属性进行归一化
  2. housing_features = feature_norm(housing_data[:, :13])
  3. # print(feature_trian.shape)
  4. housing_data = np.c_[housing_features, housing_data[:, -1]].astype(np.float32)
  5. # print(training_data[0])
  1. #归一化后的train_data,我们看下各属性的情况
  2. features_np = np.array([x[:13] for x in housing_data],np.float32)
  3. labels_np = np.array([x[-1] for x in housing_data],np.float32)
  4. data_np = np.c_[features_np, labels_np]
  5. df = pd.DataFrame(data_np, columns=feature_names)
  6. sns.boxplot(data=df.iloc[:, 0:13])
  1. <AxesSubplot:>


  1. #将训练数据集和测试数据集按照8:2的比例分开
  2. ratio = 0.8
  3. offset = int(housing_data.shape[0] * ratio)
  4. train_data = housing_data[:offset]
  5. test_data = housing_data[offset:]




  1. class Regressor(paddle.nn.Layer):
  2. def __init__(self):
  3. super(Regressor, self).__init__()
  4. self.fc = paddle.nn.Linear(13, 1,)
  5. def forward(self, inputs):
  6. pred = self.fc(inputs)
  7. return pred


  1. train_nums = []
  2. train_costs = []
  3. def draw_train_process(iters, train_costs):
  4. plt.title("training cost", fontsize=24)
  5. plt.xlabel("iter", fontsize=14)
  6. plt.ylabel("cost", fontsize=14)
  7. plt.plot(iters, train_costs, color='red', label='training cost')
  8. plt.show()


5.1 模型训练




  1. import paddle.nn.functional as F
  2. y_preds = []
  3. labels_list = []
  4. def train(model):
  5. print('start training ... ')
  6. # 开启模型训练模式
  7. model.train()
  8. EPOCH_NUM = 500
  9. train_num = 0
  10. optimizer = paddle.optimizer.SGD(learning_rate=0.001, parameters=model.parameters())
  11. for epoch_id in range(EPOCH_NUM):
  12. # 在每轮迭代开始之前,将训练数据的顺序随机的打乱
  13. np.random.shuffle(train_data)
  14. # 将训练数据进行拆分,每个batch包含20条数据
  15. mini_batches = [train_data[k: k+BATCH_SIZE] for k in range(0, len(train_data), BATCH_SIZE)]
  16. for batch_id, data in enumerate(mini_batches):
  17. features_np = np.array(data[:, :13], np.float32)
  18. labels_np = np.array(data[:, -1:], np.float32)
  19. features = paddle.to_tensor(features_np)
  20. labels = paddle.to_tensor(labels_np)
  21. #前向计算
  22. y_pred = model(features)
  23. cost = F.mse_loss(y_pred, label=labels)
  24. train_cost = cost.numpy()[0]
  25. #反向传播
  26. cost.backward()
  27. #最小化loss,更新参数
  28. optimizer.step()
  29. # 清除梯度
  30. optimizer.clear_grad()
  31. if batch_id%30 == 0 and epoch_id%50 == 0:
  32. print("Pass:%d,Cost:%0.5f"%(epoch_id, train_cost))
  33. train_num = train_num + BATCH_SIZE
  34. train_nums.append(train_num)
  35. train_costs.append(train_cost)
  36. model = Regressor()
  37. train(model)
  1. start training ...
  2. Pass:0,Cost:724.19617
  3. Pass:50,Cost:62.97696
  4. Pass:100,Cost:96.54344
  5. Pass:150,Cost:49.87206
  6. Pass:200,Cost:32.18977
  7. Pass:250,Cost:30.61844
  8. Pass:300,Cost:42.43702
  9. Pass:350,Cost:63.68068
  10. Pass:400,Cost:31.93441
  11. Pass:450,Cost:18.98611
  1. matplotlib.use('TkAgg')
  2. %matplotlib inline
  3. draw_train_process(train_nums, train_costs)



5.2 模型预测

  1. #获取预测数据
  3. infer_features_np = np.array([data[:13] for data in test_data]).astype("float32")
  4. infer_labels_np = np.array([data[-1] for data in test_data]).astype("float32")
  5. infer_features = paddle.to_tensor(infer_features_np)
  6. infer_labels = paddle.to_tensor(infer_labels_np)
  7. fetch_list = model(infer_features)
  8. sum_cost = 0
  9. for i in range(INFER_BATCH_SIZE):
  10. infer_result = fetch_list[i][0]
  11. ground_truth = infer_labels[i]
  12. if i % 10 == 0:
  13. print("No.%d: infer result is %.2f,ground truth is %.2f" % (i, infer_result, ground_truth))
  14. cost = paddle.pow(infer_result - ground_truth, 2)
  15. sum_cost += cost
  16. mean_loss = sum_cost / INFER_BATCH_SIZE
  17. print("Mean loss is:", mean_loss.numpy())
  1. No.0: infer result is 12.00,ground truth is 8.50
  2. No.10: infer result is 5.56,ground truth is 7.00
  3. No.20: infer result is 15.01,ground truth is 11.70
  4. No.30: infer result is 16.49,ground truth is 11.70
  5. No.40: infer result is 13.58,ground truth is 10.80
  6. No.50: infer result is 15.98,ground truth is 14.90
  7. No.60: infer result is 18.70,ground truth is 21.40
  8. No.70: infer result is 15.55,ground truth is 13.80
  9. No.80: infer result is 18.15,ground truth is 20.60
  10. No.90: infer result is 21.36,ground truth is 24.50
  11. Mean loss is: [12.574625]
  1. def plot_pred_ground(pred, ground):
  2. plt.figure()
  3. plt.title("Predication v.s. Ground truth", fontsize=24)
  4. plt.xlabel("ground truth price(unit:$1000)", fontsize=14)
  5. plt.ylabel("predict price", fontsize=14)
  6. plt.scatter(ground, pred, alpha=0.5) # scatter:散点图,alpha:"透明度"
  7. plt.plot(ground, ground, c='red')
  8. plt.show()
  1. plot_pred_ground(fetch_list, infer_labels_np)





  1. import paddle
  2. paddle.set_default_dtype("float64")
  3. #step1:用高层API定义数据集,无需进行数据处理等,高层API为您一条龙搞定
  4. train_dataset = paddle.text.datasets.UCIHousing(mode='train')
  5. eval_dataset = paddle.text.datasets.UCIHousing(mode='test')
  6. #step2:定义模型
  7. class UCIHousing(paddle.nn.Layer):
  8. def __init__(self):
  9. super(UCIHousing, self).__init__()
  10. self.fc = paddle.nn.Linear(13, 1, None)
  11. def forward(self, input):
  12. pred = self.fc(input)
  13. return pred
  14. #step3:训练模型
  15. model = paddle.Model(UCIHousing())
  16. model.prepare(paddle.optimizer.Adam(parameters=model.parameters()),
  17. paddle.nn.MSELoss())
  18. model.fit(train_dataset, eval_dataset, epochs=5, batch_size=8, verbose=1)
  1. The loss value printed in the log is the current step, and the metric is the average value of previous step.
  2. Epoch 1/5
  3. step 51/51 [==============================] - loss: 628.4189 - 2ms/step
  4. Eval begin...
  5. The loss value printed in the log is the current batch, and the metric is the average value of previous step.
  6. step 13/13 [==============================] - loss: 385.1105 - 990us/step
  7. Eval samples: 102
  8. Epoch 2/5
  9. step 51/51 [==============================] - loss: 416.6072 - 2ms/step
  10. Eval begin...
  11. The loss value printed in the log is the current batch, and the metric is the average value of previous step.
  12. step 13/13 [==============================] - loss: 382.5877 - 1ms/step
  13. Eval samples: 102
  14. Epoch 3/5
  15. step 51/51 [==============================] - loss: 417.1789 - 1ms/step
  16. Eval begin...
  17. The loss value printed in the log is the current batch, and the metric is the average value of previous step.
  18. step 13/13 [==============================] - loss: 380.1073 - 1ms/step
  19. Eval samples: 102
  20. Epoch 4/5
  21. step 51/51 [==============================] - loss: 424.5966 - 1ms/step
  22. Eval begin...
  23. The loss value printed in the log is the current batch, and the metric is the average value of previous step.
  24. step 13/13 [==============================] - loss: 377.6421 - 972us/step
  25. Eval samples: 102
  26. Epoch 5/5
  27. step 51/51 [==============================] - loss: 466.6127 - 1ms/step
  28. Eval begin...
  29. The loss value printed in the log is the current batch, and the metric is the average value of previous step.
  30. step 13/13 [==============================] - loss: 375.1613 - 925us/step
  31. Eval samples: 102