1-1,结构化数据建模流程范例

一,准备数据

titanic数据集的目标是根据乘客信息预测他们在Titanic号撞击冰山沉没后能否生存。

结构化数据一般会使用Pandas中的DataFrame进行预处理。

  1. import numpy as np
  2. import pandas as pd
  3. import matplotlib.pyplot as plt
  4. import tensorflow as tf
  5. from tensorflow.keras import models,layers
  6. dftrain_raw = pd.read_csv('./data/titanic/train.csv')
  7. dftest_raw = pd.read_csv('./data/titanic/test.csv')
  8. dftrain_raw.head(10)

1-1,结构化数据建模流程范例 - 图1

字段说明:

  • Survived:0代表死亡,1代表存活【y标签】
  • Pclass:乘客所持票类,有三种值(1,2,3) 【转换成onehot编码】
  • Name:乘客姓名 【舍去】
  • Sex:乘客性别 【转换成bool特征】
  • Age:乘客年龄(有缺失) 【数值特征,添加“年龄是否缺失”作为辅助特征】
  • SibSp:乘客兄弟姐妹/配偶的个数(整数值) 【数值特征】
  • Parch:乘客父母/孩子的个数(整数值)【数值特征】
  • Ticket:票号(字符串)【舍去】
  • Fare:乘客所持票的价格(浮点数,0-500不等) 【数值特征】
  • Cabin:乘客所在船舱(有缺失) 【添加“所在船舱是否缺失”作为辅助特征】
  • Embarked:乘客登船港口:S、C、Q(有缺失)【转换成onehot编码,四维度 S,C,Q,nan】

利用Pandas的数据可视化功能我们可以简单地进行探索性数据分析EDA(Exploratory Data Analysis)。

label分布情况

  1. %matplotlib inline
  2. %config InlineBackend.figure_format = 'png'
  3. ax = dftrain_raw['Survived'].value_counts().plot(kind = 'bar',
  4. figsize = (12,8),fontsize=15,rot = 0)
  5. ax.set_ylabel('Counts',fontsize = 15)
  6. ax.set_xlabel('Survived',fontsize = 15)
  7. plt.show()

1-1,结构化数据建模流程范例 - 图2

年龄分布情况

  1. %matplotlib inline
  2. %config InlineBackend.figure_format = 'png'
  3. ax = dftrain_raw['Age'].plot(kind = 'hist',bins = 20,color= 'purple',
  4. figsize = (12,8),fontsize=15)
  5. ax.set_ylabel('Frequency',fontsize = 15)
  6. ax.set_xlabel('Age',fontsize = 15)
  7. plt.show()

1-1,结构化数据建模流程范例 - 图3

年龄和label的相关性

  1. %matplotlib inline
  2. %config InlineBackend.figure_format = 'png'
  3. ax = dftrain_raw.query('Survived == 0')['Age'].plot(kind = 'density',
  4. figsize = (12,8),fontsize=15)
  5. dftrain_raw.query('Survived == 1')['Age'].plot(kind = 'density',
  6. figsize = (12,8),fontsize=15)
  7. ax.legend(['Survived==0','Survived==1'],fontsize = 12)
  8. ax.set_ylabel('Density',fontsize = 15)
  9. ax.set_xlabel('Age',fontsize = 15)
  10. plt.show()

1-1,结构化数据建模流程范例 - 图4

下面为正式的数据预处理

  1. def preprocessing(dfdata):
  2. dfresult= pd.DataFrame()
  3. #Pclass
  4. dfPclass = pd.get_dummies(dfdata['Pclass'])
  5. dfPclass.columns = ['Pclass_' +str(x) for x in dfPclass.columns ]
  6. dfresult = pd.concat([dfresult,dfPclass],axis = 1)
  7. #Sex
  8. dfSex = pd.get_dummies(dfdata['Sex'])
  9. dfresult = pd.concat([dfresult,dfSex],axis = 1)
  10. #Age
  11. dfresult['Age'] = dfdata['Age'].fillna(0)
  12. dfresult['Age_null'] = pd.isna(dfdata['Age']).astype('int32')
  13. #SibSp,Parch,Fare
  14. dfresult['SibSp'] = dfdata['SibSp']
  15. dfresult['Parch'] = dfdata['Parch']
  16. dfresult['Fare'] = dfdata['Fare']
  17. #Carbin
  18. dfresult['Cabin_null'] = pd.isna(dfdata['Cabin']).astype('int32')
  19. #Embarked
  20. dfEmbarked = pd.get_dummies(dfdata['Embarked'],dummy_na=True)
  21. dfEmbarked.columns = ['Embarked_' + str(x) for x in dfEmbarked.columns]
  22. dfresult = pd.concat([dfresult,dfEmbarked],axis = 1)
  23. return(dfresult)
  24. x_train = preprocessing(dftrain_raw)
  25. y_train = dftrain_raw['Survived'].values
  26. x_test = preprocessing(dftest_raw)
  27. y_test = dftest_raw['Survived'].values
  28. print("x_train.shape =", x_train.shape )
  29. print("x_test.shape =", x_test.shape )
  1. x_train.shape = (712, 15)
  2. x_test.shape = (179, 15)

二,定义模型

使用Keras接口有以下3种方式构建模型:使用Sequential按层顺序构建模型,使用函数式API构建任意结构模型,继承Model基类构建自定义模型。

此处选择使用最简单的Sequential,按层顺序模型。

  1. tf.keras.backend.clear_session()
  2. model = models.Sequential()
  3. model.add(layers.Dense(20,activation = 'relu',input_shape=(15,)))
  4. model.add(layers.Dense(10,activation = 'relu' ))
  5. model.add(layers.Dense(1,activation = 'sigmoid' ))
  6. model.summary()
  1. Model: "sequential"
  2. _________________________________________________________________
  3. Layer (type) Output Shape Param #
  4. =================================================================
  5. dense (Dense) (None, 20) 320
  6. _________________________________________________________________
  7. dense_1 (Dense) (None, 10) 210
  8. _________________________________________________________________
  9. dense_2 (Dense) (None, 1) 11
  10. =================================================================
  11. Total params: 541
  12. Trainable params: 541
  13. Non-trainable params: 0
  14. _________________________________________________________________

三,训练模型

训练模型通常有3种方法,内置fit方法,内置train_on_batch方法,以及自定义训练循环。此处我们选择最常用也最简单的内置fit方法。

  1. # 二分类问题选择二元交叉熵损失函数
  2. model.compile(optimizer='adam',
  3. loss='binary_crossentropy',
  4. metrics=['AUC'])
  5. history = model.fit(x_train,y_train,
  6. batch_size= 64,
  7. epochs= 30,
  8. validation_split=0.2 #分割一部分训练数据用于验证
  9. )
  1. Train on 569 samples, validate on 143 samples
  2. Epoch 1/30
  3. 569/569 [==============================] - 1s 2ms/sample - loss: 3.5841 - AUC: 0.4079 - val_loss: 3.4429 - val_AUC: 0.4129
  4. Epoch 2/30
  5. 569/569 [==============================] - 0s 102us/sample - loss: 2.6093 - AUC: 0.3967 - val_loss: 2.4886 - val_AUC: 0.4139
  6. Epoch 3/30
  7. 569/569 [==============================] - 0s 68us/sample - loss: 1.8375 - AUC: 0.4003 - val_loss: 1.7383 - val_AUC: 0.4223
  8. Epoch 4/30
  9. 569/569 [==============================] - 0s 83us/sample - loss: 1.2545 - AUC: 0.4390 - val_loss: 1.1936 - val_AUC: 0.4765
  10. Epoch 5/30
  11. 569/569 [==============================] - ETA: 0s - loss: 1.4435 - AUC: 0.375 - 0s 90us/sample - loss: 0.9141 - AUC: 0.5192 - val_loss: 0.8274 - val_AUC: 0.5584
  12. Epoch 6/30
  13. 569/569 [==============================] - 0s 110us/sample - loss: 0.7052 - AUC: 0.6290 - val_loss: 0.6596 - val_AUC: 0.6880
  14. Epoch 7/30
  15. 569/569 [==============================] - 0s 90us/sample - loss: 0.6410 - AUC: 0.7086 - val_loss: 0.6519 - val_AUC: 0.6845
  16. Epoch 8/30
  17. 569/569 [==============================] - 0s 93us/sample - loss: 0.6246 - AUC: 0.7080 - val_loss: 0.6480 - val_AUC: 0.6846
  18. Epoch 9/30
  19. 569/569 [==============================] - 0s 73us/sample - loss: 0.6088 - AUC: 0.7113 - val_loss: 0.6497 - val_AUC: 0.6838
  20. Epoch 10/30
  21. 569/569 [==============================] - 0s 79us/sample - loss: 0.6051 - AUC: 0.7117 - val_loss: 0.6454 - val_AUC: 0.6873
  22. Epoch 11/30
  23. 569/569 [==============================] - 0s 96us/sample - loss: 0.5972 - AUC: 0.7218 - val_loss: 0.6369 - val_AUC: 0.6888
  24. Epoch 12/30
  25. 569/569 [==============================] - 0s 92us/sample - loss: 0.5918 - AUC: 0.7294 - val_loss: 0.6330 - val_AUC: 0.6908
  26. Epoch 13/30
  27. 569/569 [==============================] - 0s 75us/sample - loss: 0.5864 - AUC: 0.7363 - val_loss: 0.6281 - val_AUC: 0.6948
  28. Epoch 14/30
  29. 569/569 [==============================] - 0s 104us/sample - loss: 0.5832 - AUC: 0.7426 - val_loss: 0.6240 - val_AUC: 0.7030
  30. Epoch 15/30
  31. 569/569 [==============================] - 0s 74us/sample - loss: 0.5777 - AUC: 0.7507 - val_loss: 0.6200 - val_AUC: 0.7066
  32. Epoch 16/30
  33. 569/569 [==============================] - 0s 79us/sample - loss: 0.5726 - AUC: 0.7569 - val_loss: 0.6155 - val_AUC: 0.7132
  34. Epoch 17/30
  35. 569/569 [==============================] - 0s 99us/sample - loss: 0.5674 - AUC: 0.7643 - val_loss: 0.6070 - val_AUC: 0.7255
  36. Epoch 18/30
  37. 569/569 [==============================] - 0s 97us/sample - loss: 0.5631 - AUC: 0.7721 - val_loss: 0.6061 - val_AUC: 0.7305
  38. Epoch 19/30
  39. 569/569 [==============================] - 0s 73us/sample - loss: 0.5580 - AUC: 0.7792 - val_loss: 0.6027 - val_AUC: 0.7332
  40. Epoch 20/30
  41. 569/569 [==============================] - 0s 85us/sample - loss: 0.5533 - AUC: 0.7861 - val_loss: 0.5997 - val_AUC: 0.7366
  42. Epoch 21/30
  43. 569/569 [==============================] - 0s 87us/sample - loss: 0.5497 - AUC: 0.7926 - val_loss: 0.5961 - val_AUC: 0.7433
  44. Epoch 22/30
  45. 569/569 [==============================] - 0s 101us/sample - loss: 0.5454 - AUC: 0.7987 - val_loss: 0.5943 - val_AUC: 0.7438
  46. Epoch 23/30
  47. 569/569 [==============================] - 0s 100us/sample - loss: 0.5398 - AUC: 0.8057 - val_loss: 0.5926 - val_AUC: 0.7492
  48. Epoch 24/30
  49. 569/569 [==============================] - 0s 79us/sample - loss: 0.5328 - AUC: 0.8122 - val_loss: 0.5912 - val_AUC: 0.7493
  50. Epoch 25/30
  51. 569/569 [==============================] - 0s 86us/sample - loss: 0.5283 - AUC: 0.8147 - val_loss: 0.5902 - val_AUC: 0.7509
  52. Epoch 26/30
  53. 569/569 [==============================] - 0s 67us/sample - loss: 0.5246 - AUC: 0.8196 - val_loss: 0.5845 - val_AUC: 0.7552
  54. Epoch 27/30
  55. 569/569 [==============================] - 0s 72us/sample - loss: 0.5205 - AUC: 0.8271 - val_loss: 0.5837 - val_AUC: 0.7584
  56. Epoch 28/30
  57. 569/569 [==============================] - 0s 74us/sample - loss: 0.5144 - AUC: 0.8302 - val_loss: 0.5848 - val_AUC: 0.7561
  58. Epoch 29/30
  59. 569/569 [==============================] - 0s 77us/sample - loss: 0.5099 - AUC: 0.8326 - val_loss: 0.5809 - val_AUC: 0.7583
  60. Epoch 30/30
  61. 569/569 [==============================] - 0s 80us/sample - loss: 0.5071 - AUC: 0.8349 - val_loss: 0.5816 - val_AUC: 0.7605

四,评估模型

我们首先评估一下模型在训练集和验证集上的效果。

  1. %matplotlib inline
  2. %config InlineBackend.figure_format = 'svg'
  3. import matplotlib.pyplot as plt
  4. def plot_metric(history, metric):
  5. train_metrics = history.history[metric]
  6. val_metrics = history.history['val_'+metric]
  7. epochs = range(1, len(train_metrics) + 1)
  8. plt.plot(epochs, train_metrics, 'bo--')
  9. plt.plot(epochs, val_metrics, 'ro-')
  10. plt.title('Training and validation '+ metric)
  11. plt.xlabel("Epochs")
  12. plt.ylabel(metric)
  13. plt.legend(["train_"+metric, 'val_'+metric])
  14. plt.show()
  1. plot_metric(history,"loss")

1-1,结构化数据建模流程范例 - 图5

  1. plot_metric(history,"AUC")

1-1,结构化数据建模流程范例 - 图6

我们再看一下模型在测试集上的效果.

  1. model.evaluate(x = x_test,y = y_test)
  1. [0.5191367897907448, 0.8122605]

五,使用模型

  1. #预测概率
  2. model.predict(x_test[0:10])
  3. #model(tf.constant(x_test[0:10].values,dtype = tf.float32)) #等价写法
  1. array([[0.26501188],
  2. [0.40970832],
  3. [0.44285864],
  4. [0.78408605],
  5. [0.47650957],
  6. [0.43849158],
  7. [0.27426785],
  8. [0.5962582 ],
  9. [0.59476686],
  10. [0.17882936]], dtype=float32)
  1. #预测类别
  2. model.predict_classes(x_test[0:10])
  1. array([[0],
  2. [0],
  3. [0],
  4. [1],
  5. [0],
  6. [0],
  7. [0],
  8. [1],
  9. [1],
  10. [0]], dtype=int32)

六,保存模型

可以使用Keras方式保存模型,也可以使用TensorFlow原生方式保存。前者仅仅适合使用Python环境恢复模型,后者则可以跨平台进行模型部署。

推荐使用后一种方式进行保存。

1,Keras方式保存

  1. # 保存模型结构及权重
  2. model.save('./data/keras_model.h5')
  3. del model #删除现有模型
  4. # identical to the previous one
  5. model = models.load_model('./data/keras_model.h5')
  6. model.evaluate(x_test,y_test)
  1. [0.5191367897907448, 0.8122605]
  1. # 保存模型结构
  2. json_str = model.to_json()
  3. # 恢复模型结构
  4. model_json = models.model_from_json(json_str)
  1. #保存模型权重
  2. model.save_weights('./data/keras_model_weight.h5')
  3. # 恢复模型结构
  4. model_json = models.model_from_json(json_str)
  5. model_json.compile(
  6. optimizer='adam',
  7. loss='binary_crossentropy',
  8. metrics=['AUC']
  9. )
  10. # 加载权重
  11. model_json.load_weights('./data/keras_model_weight.h5')
  12. model_json.evaluate(x_test,y_test)
  1. [0.5191367897907448, 0.8122605]

2,TensorFlow原生方式保存

  1. # 保存权重,该方式仅仅保存权重张量
  2. model.save_weights('./data/tf_model_weights.ckpt',save_format = "tf")
  1. # 保存模型结构与模型参数到文件,该方式保存的模型具有跨平台性便于部署
  2. model.save('./data/tf_model_savedmodel', save_format="tf")
  3. print('export saved model.')
  4. model_loaded = tf.keras.models.load_model('./data/tf_model_savedmodel')
  5. model_loaded.evaluate(x_test,y_test)
  1. [0.5191365896656527, 0.8122605]

如果对本书内容理解上有需要进一步和作者交流的地方,欢迎在公众号”Python与算法之美”下留言。作者时间和精力有限,会酌情予以回复。

也可以在公众号后台回复关键字:加群,加入读者交流群和大家讨论。

image.png