1-3,文本数据建模流程范例

一,准备数据

imdb数据集的目标是根据电影评论的文本内容预测评论的情感标签。

训练集有20000条电影评论文本,测试集有5000条电影评论文本,其中正面评论和负面评论都各占一半。

文本数据预处理较为繁琐,包括中文切词(本示例不涉及),构建词典,编码转换,序列填充,构建数据管道等等。

在tensorflow中完成文本数据预处理的常用方案有两种,第一种是利用tf.keras.preprocessing中的Tokenizer词典构建工具和tf.keras.utils.Sequence构建文本数据生成器管道。

第二种是使用tf.data.Dataset搭配.keras.layers.experimental.preprocessing.TextVectorization预处理层。

第一种方法较为复杂,其使用范例可以参考以下文章。

https://zhuanlan.zhihu.com/p/67697840

第二种方法为TensorFlow原生方式,相对也更加简单一些。

我们此处介绍第二种方法。

1-3,文本数据建模流程范例 - 图1

  1. import numpy as np
  2. import pandas as pd
  3. from matplotlib import pyplot as plt
  4. import tensorflow as tf
  5. from tensorflow.keras import models,layers,preprocessing,optimizers,losses,metrics
  6. from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
  7. import re,string
  8. train_data_path = "./data/imdb/train.csv"
  9. test_data_path = "./data/imdb/test.csv"
  10. MAX_WORDS = 10000 # 仅考虑最高频的10000个词
  11. MAX_LEN = 200 # 每个样本保留200个词的长度
  12. BATCH_SIZE = 20
  13. #构建管道
  14. def split_line(line):
  15. arr = tf.strings.split(line,"\t")
  16. label = tf.expand_dims(tf.cast(tf.strings.to_number(arr[0]),tf.int32),axis = 0)
  17. text = tf.expand_dims(arr[1],axis = 0)
  18. return (text,label)
  19. ds_train_raw = tf.data.TextLineDataset(filenames = [train_data_path]) \
  20. .map(split_line,num_parallel_calls = tf.data.experimental.AUTOTUNE) \
  21. .shuffle(buffer_size = 1000).batch(BATCH_SIZE) \
  22. .prefetch(tf.data.experimental.AUTOTUNE)
  23. ds_test_raw = tf.data.TextLineDataset(filenames = [test_data_path]) \
  24. .map(split_line,num_parallel_calls = tf.data.experimental.AUTOTUNE) \
  25. .batch(BATCH_SIZE) \
  26. .prefetch(tf.data.experimental.AUTOTUNE)
  27. #构建词典
  28. def clean_text(text):
  29. lowercase = tf.strings.lower(text)
  30. stripped_html = tf.strings.regex_replace(lowercase, '<br />', ' ')
  31. cleaned_punctuation = tf.strings.regex_replace(stripped_html,
  32. '[%s]' % re.escape(string.punctuation),'')
  33. return cleaned_punctuation
  34. vectorize_layer = TextVectorization(
  35. standardize=clean_text,
  36. split = 'whitespace',
  37. max_tokens=MAX_WORDS-1, #有一个留给占位符
  38. output_mode='int',
  39. output_sequence_length=MAX_LEN)
  40. ds_text = ds_train_raw.map(lambda text,label: text)
  41. vectorize_layer.adapt(ds_text)
  42. print(vectorize_layer.get_vocabulary()[0:100])
  43. #单词编码
  44. ds_train = ds_train_raw.map(lambda text,label:(vectorize_layer(text),label)) \
  45. .prefetch(tf.data.experimental.AUTOTUNE)
  46. ds_test = ds_test_raw.map(lambda text,label:(vectorize_layer(text),label)) \
  47. .prefetch(tf.data.experimental.AUTOTUNE)
  1. [b'the', b'and', b'a', b'of', b'to', b'is', b'in', b'it', b'i', b'this', b'that', b'was', b'as', b'for', b'with', b'movie', b'but', b'film', b'on', b'not', b'you', b'his', b'are', b'have', b'be', b'he', b'one', b'its', b'at', b'all', b'by', b'an', b'they', b'from', b'who', b'so', b'like', b'her', b'just', b'or', b'about', b'has', b'if', b'out', b'some', b'there', b'what', b'good', b'more', b'when', b'very', b'she', b'even', b'my', b'no', b'would', b'up', b'time', b'only', b'which', b'story', b'really', b'their', b'were', b'had', b'see', b'can', b'me', b'than', b'we', b'much', b'well', b'get', b'been', b'will', b'into', b'people', b'also', b'other', b'do', b'bad', b'because', b'great', b'first', b'how', b'him', b'most', b'dont', b'made', b'then', b'them', b'films', b'movies', b'way', b'make', b'could', b'too', b'any', b'after', b'characters']

二,定义模型

使用Keras接口有以下3种方式构建模型:使用Sequential按层顺序构建模型,使用函数式API构建任意结构模型,继承Model基类构建自定义模型。

此处选择使用继承Model基类构建自定义模型。

  1. # 演示自定义模型范例,实际上应该优先使用Sequential或者函数式API
  2. tf.keras.backend.clear_session()
  3. class CnnModel(models.Model):
  4. def __init__(self):
  5. super(CnnModel, self).__init__()
  6. def build(self,input_shape):
  7. self.embedding = layers.Embedding(MAX_WORDS,7,input_length=MAX_LEN)
  8. self.conv_1 = layers.Conv1D(16, kernel_size= 5,name = "conv_1",activation = "relu")
  9. self.pool = layers.MaxPool1D()
  10. self.conv_2 = layers.Conv1D(128, kernel_size=2,name = "conv_2",activation = "relu")
  11. self.flatten = layers.Flatten()
  12. self.dense = layers.Dense(1,activation = "sigmoid")
  13. super(CnnModel,self).build(input_shape)
  14. def call(self, x):
  15. x = self.embedding(x)
  16. x = self.conv_1(x)
  17. x = self.pool(x)
  18. x = self.conv_2(x)
  19. x = self.pool(x)
  20. x = self.flatten(x)
  21. x = self.dense(x)
  22. return(x)
  23. model = CnnModel()
  24. model.build(input_shape =(None,MAX_LEN))
  25. model.summary()
  1. Model: "cnn_model"
  2. _________________________________________________________________
  3. Layer (type) Output Shape Param #
  4. =================================================================
  5. embedding (Embedding) multiple 70000
  6. _________________________________________________________________
  7. conv_1 (Conv1D) multiple 576
  8. _________________________________________________________________
  9. max_pooling1d (MaxPooling1D) multiple 0
  10. _________________________________________________________________
  11. conv_2 (Conv1D) multiple 4224
  12. _________________________________________________________________
  13. flatten (Flatten) multiple 0
  14. _________________________________________________________________
  15. dense (Dense) multiple 6145
  16. =================================================================
  17. Total params: 80,945
  18. Trainable params: 80,945
  19. Non-trainable params: 0
  20. _________________________________________________________________

三,训练模型

训练模型通常有3种方法,内置fit方法,内置train_on_batch方法,以及自定义训练循环。此处我们通过自定义训练循环训练模型。

  1. #打印时间分割线
  2. @tf.function
  3. def printbar():
  4. today_ts = tf.timestamp()%(24*60*60)
  5. hour = tf.cast(today_ts//3600+8,tf.int32)%tf.constant(24)
  6. minite = tf.cast((today_ts%3600)//60,tf.int32)
  7. second = tf.cast(tf.floor(today_ts%60),tf.int32)
  8. def timeformat(m):
  9. if tf.strings.length(tf.strings.format("{}",m))==1:
  10. return(tf.strings.format("0{}",m))
  11. else:
  12. return(tf.strings.format("{}",m))
  13. timestring = tf.strings.join([timeformat(hour),timeformat(minite),
  14. timeformat(second)],separator = ":")
  15. tf.print("=========="*8+timestring)
  1. optimizer = optimizers.Nadam()
  2. loss_func = losses.BinaryCrossentropy()
  3. train_loss = metrics.Mean(name='train_loss')
  4. train_metric = metrics.BinaryAccuracy(name='train_accuracy')
  5. valid_loss = metrics.Mean(name='valid_loss')
  6. valid_metric = metrics.BinaryAccuracy(name='valid_accuracy')
  7. @tf.function
  8. def train_step(model, features, labels):
  9. with tf.GradientTape() as tape:
  10. predictions = model(features,training = True)
  11. loss = loss_func(labels, predictions)
  12. gradients = tape.gradient(loss, model.trainable_variables)
  13. optimizer.apply_gradients(zip(gradients, model.trainable_variables))
  14. train_loss.update_state(loss)
  15. train_metric.update_state(labels, predictions)
  16. @tf.function
  17. def valid_step(model, features, labels):
  18. predictions = model(features,training = False)
  19. batch_loss = loss_func(labels, predictions)
  20. valid_loss.update_state(batch_loss)
  21. valid_metric.update_state(labels, predictions)
  22. def train_model(model,ds_train,ds_valid,epochs):
  23. for epoch in tf.range(1,epochs+1):
  24. for features, labels in ds_train:
  25. train_step(model,features,labels)
  26. for features, labels in ds_valid:
  27. valid_step(model,features,labels)
  28. #此处logs模板需要根据metric具体情况修改
  29. logs = 'Epoch={},Loss:{},Accuracy:{},Valid Loss:{},Valid Accuracy:{}'
  30. if epoch%1==0:
  31. printbar()
  32. tf.print(tf.strings.format(logs,
  33. (epoch,train_loss.result(),train_metric.result(),valid_loss.result(),valid_metric.result())))
  34. tf.print("")
  35. train_loss.reset_states()
  36. valid_loss.reset_states()
  37. train_metric.reset_states()
  38. valid_metric.reset_states()
  39. train_model(model,ds_train,ds_test,epochs = 6)
  1. ================================================================================13:54:08
  2. Epoch=1,Loss:0.442317516,Accuracy:0.7695,Valid Loss:0.323672801,Valid Accuracy:0.8614
  3. ================================================================================13:54:20
  4. Epoch=2,Loss:0.245737702,Accuracy:0.90215,Valid Loss:0.356488883,Valid Accuracy:0.8554
  5. ================================================================================13:54:32
  6. Epoch=3,Loss:0.17360799,Accuracy:0.93455,Valid Loss:0.361132562,Valid Accuracy:0.8674
  7. ================================================================================13:54:44
  8. Epoch=4,Loss:0.113476314,Accuracy:0.95975,Valid Loss:0.483677238,Valid Accuracy:0.856
  9. ================================================================================13:54:57
  10. Epoch=5,Loss:0.0698405355,Accuracy:0.9768,Valid Loss:0.607856631,Valid Accuracy:0.857
  11. ================================================================================13:55:15
  12. Epoch=6,Loss:0.0366807655,Accuracy:0.98825,Valid Loss:0.745884955,Valid Accuracy:0.854

四,评估模型

通过自定义训练循环训练的模型没有经过编译,无法直接使用model.evaluate(ds_valid)方法

  1. def evaluate_model(model,ds_valid):
  2. for features, labels in ds_valid:
  3. valid_step(model,features,labels)
  4. logs = 'Valid Loss:{},Valid Accuracy:{}'
  5. tf.print(tf.strings.format(logs,(valid_loss.result(),valid_metric.result())))
  6. valid_loss.reset_states()
  7. train_metric.reset_states()
  8. valid_metric.reset_states()
  1. evaluate_model(model,ds_test)
  1. Valid Loss:0.745884418,Valid Accuracy:0.854

五,使用模型

可以使用以下方法:

  • model.predict(ds_test)
  • model(x_test)
  • model.call(x_test)
  • model.predict_on_batch(x_test)

推荐优先使用model.predict(ds_test)方法,既可以对Dataset,也可以对Tensor使用。

  1. model.predict(ds_test)
  1. array([[0.7864823 ],
  2. [0.9999901 ],
  3. [0.99944776],
  4. ...,
  5. [0.8498302 ],
  6. [0.13382755],
  7. [1. ]], dtype=float32)
  1. for x_test,_ in ds_test.take(1):
  2. print(model(x_test))
  3. #以下方法等价:
  4. #print(model.call(x_test))
  5. #print(model.predict_on_batch(x_test))
  1. tf.Tensor(
  2. [[7.8648227e-01]
  3. [9.9999011e-01]
  4. [9.9944776e-01]
  5. [3.7153201e-09]
  6. [9.4462049e-01]
  7. [2.3522753e-04]
  8. [1.2044354e-04]
  9. [9.3752089e-07]
  10. [9.9996352e-01]
  11. [9.3435925e-01]
  12. [9.8746723e-01]
  13. [9.9908626e-01]
  14. [4.1563155e-08]
  15. [4.1808244e-03]
  16. [8.0184749e-05]
  17. [8.3910513e-01]
  18. [3.5167937e-05]
  19. [7.2113985e-01]
  20. [4.5228912e-03]
  21. [9.9942589e-01]], shape=(20, 1), dtype=float32)

六,保存模型

推荐使用TensorFlow原生方式保存模型。

  1. model.save('./data/tf_model_savedmodel', save_format="tf")
  2. print('export saved model.')
  3. model_loaded = tf.keras.models.load_model('./data/tf_model_savedmodel')
  4. model_loaded.predict(ds_test)
  1. array([[0.7864823 ],
  2. [0.9999901 ],
  3. [0.99944776],
  4. ...,
  5. [0.8498302 ],
  6. [0.13382755],
  7. [1. ]], dtype=float32)

如果对本书内容理解上有需要进一步和作者交流的地方,欢迎在公众号”Python与算法之美”下留言。作者时间和精力有限,会酌情予以回复。

也可以在公众号后台回复关键字:加群,加入读者交流群和大家讨论。

image.png