1-1 Example: Modeling Procedure for Structured Data

1. Data Preparation

The purpose of the Titanic dataset is to predict whether the given passengers could be survived after Titinic hit the iceburg, according to their personal information.

We usually use DataFrame from the pandas library to pre-process the structured data.

  1. import numpy as np
  2. import pandas as pd
  3. import matplotlib.pyplot as plt
  4. import tensorflow as tf
  5. from tensorflow.keras import models,layers
  6. dftrain_raw = pd.read_csv('../data/titanic/train.csv')
  7. dftest_raw = pd.read_csv('../data/titanic/test.csv')
  8. dftrain_raw.head(10)

1-1 Example: Modeling Procedure for Structured Data - 图1

Introduction of each field:

  • Survived: 0 for death and 1 for survived [y labels]
  • Pclass: Class of the tickets, with three possible values (1,2,3) [converting to one-hot encoding]
  • Name: Name of each passenger [discarded]
  • Sex: Gender of each passenger [converting to bool type]
  • Age: Age of each passenger (partly missing) [numerical feature, should add “Whether age is missing” as auxiliary feature]
  • SibSp: Number of siblings and spouse of each passenger (interger) [numerical feature]
  • Parch: Number of parents/children of each passenger (interger) [numerical feature]
  • Ticket: Ticket number (string) [discarded]
  • Fare: Ticket price of each passenger (float, between 0 to 500) [numerical feature]
  • Cabin: Cabin where each passenger is located (partly missing) [should add “Whether cabin is missing” as auxiliary feature]
  • Embarked: Which port was each passenger embarked, possible values are S、C、Q (partly missing) [converting to one-hot encoding, four dimensions, S,C,Q,nan]

Use data visualization in pandas library for initial EDA (Exploratory Data Analysis).

Survival label distribution:

  1. %matplotlib inline
  2. %config InlineBackend.figure_format = 'png'
  3. ax = dftrain_raw['Survived'].value_counts().plot(kind = 'bar',
  4. figsize = (12,8),fontsize=15,rot = 0)
  5. ax.set_ylabel('Counts',fontsize = 15)
  6. ax.set_xlabel('Survived',fontsize = 15)
  7. plt.show()

1-1 Example: Modeling Procedure for Structured Data - 图2

Age distribution:

  1. %matplotlib inline
  2. %config InlineBackend.figure_format = 'png'
  3. ax = dftrain_raw['Age'].plot(kind = 'hist',bins = 20,color= 'purple',
  4. figsize = (12,8),fontsize=15)
  5. ax.set_ylabel('Frequency',fontsize = 15)
  6. ax.set_xlabel('Age',fontsize = 15)
  7. plt.show()

1-1 Example: Modeling Procedure for Structured Data - 图3

Correlation between age and survival label:

  1. %matplotlib inline
  2. %config InlineBackend.figure_format = 'png'
  3. ax = dftrain_raw.query('Survived == 0')['Age'].plot(kind = 'density',
  4. figsize = (12,8),fontsize=15)
  5. dftrain_raw.query('Survived == 1')['Age'].plot(kind = 'density',
  6. figsize = (12,8),fontsize=15)
  7. ax.legend(['Survived==0','Survived==1'],fontsize = 12)
  8. ax.set_ylabel('Density',fontsize = 15)
  9. ax.set_xlabel('Age',fontsize = 15)
  10. plt.show()

1-1 Example: Modeling Procedure for Structured Data - 图4

Below are code for formal data pre-processing:

  1. def preprocessing(dfdata):
  2. dfresult= pd.DataFrame()
  3. #Pclass
  4. dfPclass = pd.get_dummies(dfdata['Pclass'])
  5. dfPclass.columns = ['Pclass_' +str(x) for x in dfPclass.columns ]
  6. dfresult = pd.concat([dfresult,dfPclass],axis = 1)
  7. #Sex
  8. dfSex = pd.get_dummies(dfdata['Sex'])
  9. dfresult = pd.concat([dfresult,dfSex],axis = 1)
  10. #Age
  11. dfresult['Age'] = dfdata['Age'].fillna(0)
  12. dfresult['Age_null'] = pd.isna(dfdata['Age']).astype('int32')
  13. #SibSp,Parch,Fare
  14. dfresult['SibSp'] = dfdata['SibSp']
  15. dfresult['Parch'] = dfdata['Parch']
  16. dfresult['Fare'] = dfdata['Fare']
  17. #Carbin
  18. dfresult['Cabin_null'] = pd.isna(dfdata['Cabin']).astype('int32')
  19. #Embarked
  20. dfEmbarked = pd.get_dummies(dfdata['Embarked'],dummy_na=True)
  21. dfEmbarked.columns = ['Embarked_' + str(x) for x in dfEmbarked.columns]
  22. dfresult = pd.concat([dfresult,dfEmbarked],axis = 1)
  23. return(dfresult)
  24. x_train = preprocessing(dftrain_raw)
  25. y_train = dftrain_raw['Survived'].values
  26. x_test = preprocessing(dftest_raw)
  27. y_test = dftest_raw['Survived'].values
  28. print("x_train.shape =", x_train.shape )
  29. print("x_test.shape =", x_test.shape )
  1. x_train.shape = (712, 15)
  2. x_test.shape = (179, 15)

2. Model Definition

Usually there are three ways of modeling using APIs of Keras: sequential modeling using Sequential() function, arbitrary modeling using functional API, and customized modeling by inheriting base class Model.

Here we take the simplest way: sequential modeling using function Sequential().

  1. tf.keras.backend.clear_session()
  2. model = models.Sequential()
  3. model.add(layers.Dense(20,activation = 'relu',input_shape=(15,)))
  4. model.add(layers.Dense(10,activation = 'relu' ))
  5. model.add(layers.Dense(1,activation = 'sigmoid' ))
  6. model.summary()
  1. Model: "sequential"
  2. _________________________________________________________________
  3. Layer (type) Output Shape Param #
  4. =================================================================
  5. dense (Dense) (None, 20) 320
  6. _________________________________________________________________
  7. dense_1 (Dense) (None, 10) 210
  8. _________________________________________________________________
  9. dense_2 (Dense) (None, 1) 11
  10. =================================================================
  11. Total params: 541
  12. Trainable params: 541
  13. Non-trainable params: 0
  14. _________________________________________________________________

3. Model Training

There are three usual ways for model training: use internal function fit, use internal function train_on_batch, and customized training loop. Here we introduce the simplist way: using internal function fit.

  1. # Use binary cross entropy loss function for binary classification
  2. model.compile(optimizer='adam',
  3. loss='binary_crossentropy',
  4. metrics=['AUC'])
  5. history = model.fit(x_train,y_train,
  6. batch_size= 64,
  7. epochs= 30,
  8. validation_split=0.2 #Split part of the training data for validation
  9. )
  1. Train on 569 samples, validate on 143 samples
  2. Epoch 1/30
  3. 569/569 [==============================] - 1s 2ms/sample - loss: 3.5841 - AUC: 0.4079 - val_loss: 3.4429 - val_AUC: 0.4129
  4. Epoch 2/30
  5. 569/569 [==============================] - 0s 102us/sample - loss: 2.6093 - AUC: 0.3967 - val_loss: 2.4886 - val_AUC: 0.4139
  6. Epoch 3/30
  7. 569/569 [==============================] - 0s 68us/sample - loss: 1.8375 - AUC: 0.4003 - val_loss: 1.7383 - val_AUC: 0.4223
  8. Epoch 4/30
  9. 569/569 [==============================] - 0s 83us/sample - loss: 1.2545 - AUC: 0.4390 - val_loss: 1.1936 - val_AUC: 0.4765
  10. Epoch 5/30
  11. 569/569 [==============================] - ETA: 0s - loss: 1.4435 - AUC: 0.375 - 0s 90us/sample - loss: 0.9141 - AUC: 0.5192 - val_loss: 0.8274 - val_AUC: 0.5584
  12. Epoch 6/30
  13. 569/569 [==============================] - 0s 110us/sample - loss: 0.7052 - AUC: 0.6290 - val_loss: 0.6596 - val_AUC: 0.6880
  14. Epoch 7/30
  15. 569/569 [==============================] - 0s 90us/sample - loss: 0.6410 - AUC: 0.7086 - val_loss: 0.6519 - val_AUC: 0.6845
  16. Epoch 8/30
  17. 569/569 [==============================] - 0s 93us/sample - loss: 0.6246 - AUC: 0.7080 - val_loss: 0.6480 - val_AUC: 0.6846
  18. Epoch 9/30
  19. 569/569 [==============================] - 0s 73us/sample - loss: 0.6088 - AUC: 0.7113 - val_loss: 0.6497 - val_AUC: 0.6838
  20. Epoch 10/30
  21. 569/569 [==============================] - 0s 79us/sample - loss: 0.6051 - AUC: 0.7117 - val_loss: 0.6454 - val_AUC: 0.6873
  22. Epoch 11/30
  23. 569/569 [==============================] - 0s 96us/sample - loss: 0.5972 - AUC: 0.7218 - val_loss: 0.6369 - val_AUC: 0.6888
  24. Epoch 12/30
  25. 569/569 [==============================] - 0s 92us/sample - loss: 0.5918 - AUC: 0.7294 - val_loss: 0.6330 - val_AUC: 0.6908
  26. Epoch 13/30
  27. 569/569 [==============================] - 0s 75us/sample - loss: 0.5864 - AUC: 0.7363 - val_loss: 0.6281 - val_AUC: 0.6948
  28. Epoch 14/30
  29. 569/569 [==============================] - 0s 104us/sample - loss: 0.5832 - AUC: 0.7426 - val_loss: 0.6240 - val_AUC: 0.7030
  30. Epoch 15/30
  31. 569/569 [==============================] - 0s 74us/sample - loss: 0.5777 - AUC: 0.7507 - val_loss: 0.6200 - val_AUC: 0.7066
  32. Epoch 16/30
  33. 569/569 [==============================] - 0s 79us/sample - loss: 0.5726 - AUC: 0.7569 - val_loss: 0.6155 - val_AUC: 0.7132
  34. Epoch 17/30
  35. 569/569 [==============================] - 0s 99us/sample - loss: 0.5674 - AUC: 0.7643 - val_loss: 0.6070 - val_AUC: 0.7255
  36. Epoch 18/30
  37. 569/569 [==============================] - 0s 97us/sample - loss: 0.5631 - AUC: 0.7721 - val_loss: 0.6061 - val_AUC: 0.7305
  38. Epoch 19/30
  39. 569/569 [==============================] - 0s 73us/sample - loss: 0.5580 - AUC: 0.7792 - val_loss: 0.6027 - val_AUC: 0.7332
  40. Epoch 20/30
  41. 569/569 [==============================] - 0s 85us/sample - loss: 0.5533 - AUC: 0.7861 - val_loss: 0.5997 - val_AUC: 0.7366
  42. Epoch 21/30
  43. 569/569 [==============================] - 0s 87us/sample - loss: 0.5497 - AUC: 0.7926 - val_loss: 0.5961 - val_AUC: 0.7433
  44. Epoch 22/30
  45. 569/569 [==============================] - 0s 101us/sample - loss: 0.5454 - AUC: 0.7987 - val_loss: 0.5943 - val_AUC: 0.7438
  46. Epoch 23/30
  47. 569/569 [==============================] - 0s 100us/sample - loss: 0.5398 - AUC: 0.8057 - val_loss: 0.5926 - val_AUC: 0.7492
  48. Epoch 24/30
  49. 569/569 [==============================] - 0s 79us/sample - loss: 0.5328 - AUC: 0.8122 - val_loss: 0.5912 - val_AUC: 0.7493
  50. Epoch 25/30
  51. 569/569 [==============================] - 0s 86us/sample - loss: 0.5283 - AUC: 0.8147 - val_loss: 0.5902 - val_AUC: 0.7509
  52. Epoch 26/30
  53. 569/569 [==============================] - 0s 67us/sample - loss: 0.5246 - AUC: 0.8196 - val_loss: 0.5845 - val_AUC: 0.7552
  54. Epoch 27/30
  55. 569/569 [==============================] - 0s 72us/sample - loss: 0.5205 - AUC: 0.8271 - val_loss: 0.5837 - val_AUC: 0.7584
  56. Epoch 28/30
  57. 569/569 [==============================] - 0s 74us/sample - loss: 0.5144 - AUC: 0.8302 - val_loss: 0.5848 - val_AUC: 0.7561
  58. Epoch 29/30
  59. 569/569 [==============================] - 0s 77us/sample - loss: 0.5099 - AUC: 0.8326 - val_loss: 0.5809 - val_AUC: 0.7583
  60. Epoch 30/30
  61. 569/569 [==============================] - 0s 80us/sample - loss: 0.5071 - AUC: 0.8349 - val_loss: 0.5816 - val_AUC: 0.7605

4. Model Evaluation

First, we evaluate the model performance on the training and validation datasets.

  1. %matplotlib inline
  2. %config InlineBackend.figure_format = 'svg'
  3. import matplotlib.pyplot as plt
  4. def plot_metric(history, metric):
  5. train_metrics = history.history[metric]
  6. val_metrics = history.history['val_'+metric]
  7. epochs = range(1, len(train_metrics) + 1)
  8. plt.plot(epochs, train_metrics, 'bo--')
  9. plt.plot(epochs, val_metrics, 'ro-')
  10. plt.title('Training and validation '+ metric)
  11. plt.xlabel("Epochs")
  12. plt.ylabel(metric)
  13. plt.legend(["train_"+metric, 'val_'+metric])
  14. plt.show()
  1. plot_metric(history,"loss")

1-1 Example: Modeling Procedure for Structured Data - 图5

  1. plot_metric(history,"AUC")

1-1 Example: Modeling Procedure for Structured Data - 图6

Let’s take a look at the performance on the testing dataset.

  1. model.evaluate(x = x_test,y = y_test)
  1. [0.5191367897907448, 0.8122605]

5. Model Application

  1. #Predict the possiblities
  2. model.predict(x_test[0:10])
  3. #model(tf.constant(x_test[0:10].values,dtype = tf.float32)) #Identical way
  1. array([[0.26501188],
  2. [0.40970832],
  3. [0.44285864],
  4. [0.78408605],
  5. [0.47650957],
  6. [0.43849158],
  7. [0.27426785],
  8. [0.5962582 ],
  9. [0.59476686],
  10. [0.17882936]], dtype=float32)
  1. #Predict the classes
  2. model.predict_classes(x_test[0:10])
  1. array([[0],
  2. [0],
  3. [0],
  4. [1],
  5. [0],
  6. [0],
  7. [0],
  8. [1],
  9. [1],
  10. [0]], dtype=int32)

6. Model Saving

The trained model could be saved through either the way of Keras or the way of original TensorFlow. The former only allows using Python to retrieve the model, while the latter allows cross-platform deployment.

The latter way is recommended to save the model.

(1) Model Saving with Keras

  1. # Saving model structure and parameters
  2. model.save('../data/keras_model.h5')
  3. del model #Deleting current model
  4. # Identical to the previous one
  5. model = models.load_model('../data/keras_model.h5')
  6. model.evaluate(x_test,y_test)
  1. [0.5191367897907448, 0.8122605]
  1. # Saving the model structure
  2. json_str = model.to_json()
  3. # Retrieving the model structure
  4. model_json = models.model_from_json(json_str)
  1. # Saving the weights of the model
  2. model.save_weights('../data/keras_model_weight.h5')
  3. # Retrieving the model structure
  4. model_json = models.model_from_json(json_str)
  5. model_json.compile(
  6. optimizer='adam',
  7. loss='binary_crossentropy',
  8. metrics=['AUC']
  9. )
  10. # Load the weights
  11. model_json.load_weights('../data/keras_model_weight.h5')
  12. model_json.evaluate(x_test,y_test)
  1. [0.5191367897907448, 0.8122605]

(2) Model Saving with Original Way of TensorFlow

  1. # Saving the weights, this way only save the tensors of the weights
  2. model.save_weights('../data/tf_model_weights.ckpt',save_format = "tf")
  1. # Saving model structure and parameters to a file, so the model allows cross-platform deployment
  2. model.save('../data/tf_model_savedmodel', save_format="tf")
  3. print('export saved model.')
  4. model_loaded = tf.keras.models.load_model('../data/tf_model_savedmodel')
  5. model_loaded.evaluate(x_test,y_test)
  1. [0.5191365896656527, 0.8122605]

