5-2 feature_column

Feature column is usually applied in the feature engineering for the structured data, while rarely used for the image or text date.

1. Introduction about how to use feature column

Feature column is used to converting category features into one-hot encoding, or creating bucketing feature from continuous feature, or generating cross features from multiple features, etc.

Before creating feature column, please call the functions in the module tf.feature_column. The nine most frequently used functions in this module are shown in the figure below. All these functions will return a Categorical-Column or a Dense-Column object, but will not return bucketized_column, since the last class is inhereted from the first two classes.

Be careful: all the Categorical-Column class have to be converted into Dense-Column class through indicator_column before input to the model.

5-2 feature_column - 图1

  • numeric_column, the most frequently used function.
  • bucketized_column, generated from numerical column, listing multiple features from a numerical clumn; it is one-hot encoded.
  • categorical_column_with_identity, one-hot encoded, identical to the case that each bucket is one interger.
  • categorical_column_with_vocabulary_list, one-hot encoded; the dictionary is specified by the list.
  • categorical_column_with_vocabulary_file, one-hot encoded; the dictionary is specified by the file.
  • categorical_column_with_hash_bucket, used in the case with a large interger or a large dictionary.
  • indicator_column, generated by Categorical-Column; one-hot encoded.
  • embedding_column, generated by Categorical Column; the embedded vector distributed parameter needs learning/training. The recommended dimension of the embedded vector is the fourth root to the number of categories.
  • crossed_column, consists of arbitrary category column except for categorical_column_with_hash_bucket

2. Demonstration of feature column

Here is a complete example that solves Titanic survival problmen using feature column.

  1. import datetime
  2. import numpy as np
  3. import pandas as pd
  4. from matplotlib import pyplot as plt
  5. import tensorflow as tf
  6. from tensorflow.keras import layers,models
  7. # Printing log
  8. def printlog(info):
  9. nowtime = datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')
  10. print("\n"+"=========="*8 + "%s"%nowtime)
  11. print(info+'...\n\n')
  1. #================================================================================
  2. # 1. Constructing data pipeline
  3. #================================================================================
  4. printlog("step1: prepare dataset...")
  5. dftrain_raw = pd.read_csv("../data/titanic/train.csv")
  6. dftest_raw = pd.read_csv("../data/titanic/test.csv")
  7. dfraw = pd.concat([dftrain_raw,dftest_raw])
  8. def prepare_dfdata(dfraw):
  9. dfdata = dfraw.copy()
  10. dfdata.columns = [x.lower() for x in dfdata.columns]
  11. dfdata = dfdata.rename(columns={'survived':'label'})
  12. dfdata = dfdata.drop(['passengerid','name'],axis = 1)
  13. for col,dtype in dict(dfdata.dtypes).items():
  14. # See if there are missing values.
  15. if dfdata[col].hasnans:
  16. # Adding signs to the missing columns
  17. dfdata[col + '_nan'] = pd.isna(dfdata[col]).astype('int32')
  18. # Fill
  19. if dtype not in [np.object,np.str,np.unicode]:
  20. dfdata[col].fillna(dfdata[col].mean(),inplace = True)
  21. else:
  22. dfdata[col].fillna('',inplace = True)
  23. return(dfdata)
  24. dfdata = prepare_dfdata(dfraw)
  25. dftrain = dfdata.iloc[0:len(dftrain_raw),:]
  26. dftest = dfdata.iloc[len(dftrain_raw):,:]
  27. # Importing data from dataframe
  28. def df_to_dataset(df, shuffle=True, batch_size=32):
  29. dfdata = df.copy()
  30. if 'label' not in dfdata.columns:
  31. ds = tf.data.Dataset.from_tensor_slices(dfdata.to_dict(orient = 'list'))
  32. else:
  33. labels = dfdata.pop('label')
  34. ds = tf.data.Dataset.from_tensor_slices((dfdata.to_dict(orient = 'list'), labels))
  35. if shuffle:
  36. ds = ds.shuffle(buffer_size=len(dfdata))
  37. ds = ds.batch(batch_size)
  38. return ds
  39. ds_train = df_to_dataset(dftrain)
  40. ds_test = df_to_dataset(dftest)
  1. #================================================================================
  2. # 2. Defining the feature column
  3. #================================================================================
  4. printlog("step2: make feature columns...")
  5. feature_columns = []
  6. # Numerical column
  7. for col in ['age','fare','parch','sibsp'] + [
  8. c for c in dfdata.columns if c.endswith('_nan')]:
  9. feature_columns.append(tf.feature_column.numeric_column(col))
  10. # Bucketized column
  11. age = tf.feature_column.numeric_column('age')
  12. age_buckets = tf.feature_column.bucketized_column(age,
  13. boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65])
  14. feature_columns.append(age_buckets)
  15. # Category column
  16. # NOTE: all the Categorical-Column class have to be converted into Dense-Column class through `indicator_column` before input to the model.
  17. sex = tf.feature_column.indicator_column(
  18. tf.feature_column.categorical_column_with_vocabulary_list(
  19. key='sex',vocabulary_list=["male", "female"]))
  20. feature_columns.append(sex)
  21. pclass = tf.feature_column.indicator_column(
  22. tf.feature_column.categorical_column_with_vocabulary_list(
  23. key='pclass',vocabulary_list=[1,2,3]))
  24. feature_columns.append(pclass)
  25. ticket = tf.feature_column.indicator_column(
  26. tf.feature_column.categorical_column_with_hash_bucket('ticket',3))
  27. feature_columns.append(ticket)
  28. embarked = tf.feature_column.indicator_column(
  29. tf.feature_column.categorical_column_with_vocabulary_list(
  30. key='embarked',vocabulary_list=['S','C','B']))
  31. feature_columns.append(embarked)
  32. # Embedding column
  33. cabin = tf.feature_column.embedding_column(
  34. tf.feature_column.categorical_column_with_hash_bucket('cabin',32),2)
  35. feature_columns.append(cabin)
  36. # Crossed column
  37. pclass_cate = tf.feature_column.categorical_column_with_vocabulary_list(
  38. key='pclass',vocabulary_list=[1,2,3])
  39. crossed_feature = tf.feature_column.indicator_column(
  40. tf.feature_column.crossed_column([age_buckets, pclass_cate],hash_bucket_size=15))
  41. feature_columns.append(crossed_feature)
  1. #================================================================================
  2. # 3. Defining the model
  3. #================================================================================
  4. printlog("step3: define model...")
  5. tf.keras.backend.clear_session()
  6. model = tf.keras.Sequential([
  7. layers.DenseFeatures(feature_columns), # Placing the feature into tf.keras.layers.DenseFeatures
  8. layers.Dense(64, activation='relu'),
  9. layers.Dense(64, activation='relu'),
  10. layers.Dense(1, activation='sigmoid')
  11. ])
  1. #================================================================================
  2. # 4. Training the model
  3. #================================================================================
  4. printlog("step4: train model...")
  5. model.compile(optimizer='adam',
  6. loss='binary_crossentropy',
  7. metrics=['accuracy'])
  8. history = model.fit(ds_train,
  9. validation_data=ds_test,
  10. epochs=10)
  1. #================================================================================
  2. # 5. Evaluating the model
  3. #================================================================================
  4. printlog("step5: eval model...")
  5. model.summary()
  6. %matplotlib inline
  7. %config InlineBackend.figure_format = 'svg'
  8. import matplotlib.pyplot as plt
  9. def plot_metric(history, metric):
  10. train_metrics = history.history[metric]
  11. val_metrics = history.history['val_'+metric]
  12. epochs = range(1, len(train_metrics) + 1)
  13. plt.plot(epochs, train_metrics, 'bo--')
  14. plt.plot(epochs, val_metrics, 'ro-')
  15. plt.title('Training and validation '+ metric)
  16. plt.xlabel("Epochs")
  17. plt.ylabel(metric)
  18. plt.legend(["train_"+metric, 'val_'+metric])
  19. plt.show()
  20. plot_metric(history,"accuracy")
  1. Model: "sequential"
  2. _________________________________________________________________
  3. Layer (type) Output Shape Param #
  4. =================================================================
  5. dense_features (DenseFeature multiple 64
  6. _________________________________________________________________
  7. dense (Dense) multiple 3008
  8. _________________________________________________________________
  9. dense_1 (Dense) multiple 4160
  10. _________________________________________________________________
  11. dense_2 (Dense) multiple 65
  12. =================================================================
  13. Total params: 7,297
  14. Trainable params: 7,297
  15. Non-trainable params: 0
  16. _________________________________________________________________

5-2 feature_column - 图2

Please leave comments in the WeChat official account “Python与算法之美” (Elegance of Python and Algorithms) if you want to communicate with the author about the content. The author will try best to reply given the limited time available.

You are also welcomed to join the group chat with the other readers through replying 加群 (join group) in the WeChat official account.

image.png