概述

上一节我们研究了资源部署优化的方法,通过使用单GPU和分布式部署,提升模型训练的效率。本节我们依旧横向展开"横纵式",如 图1 所示,探讨在手写数字识别任务中,为了保证模型的真实效果,在模型训练部分,对模型进行一些调试和优化的方法。

【手写数字识别】之训练调试与优化 - 图1

图1:“横纵式”教学法 — 训练过程

训练过程优化思路主要有如下五个关键环节:

1. 计算分类准确率,观测模型训练效果。

交叉熵损失函数只能作为优化目标,无法直接准确衡量模型的训练效果。准确率可以直接衡量训练效果,但由于其离散性质,不适合做为损失函数优化神经网络。

2. 检查模型训练过程,识别潜在问题。

如果模型的损失或者评估指标表现异常,通常需要打印模型每一层的输入和输出来定位问题,分析每一层的内容来获取错误的原因。

3. 加入校验或测试,更好评价模型效果。

理想的模型训练结果是在训练集和验证集上均有较高的准确率,如果训练集上的准确率高于验证集,说明网络训练程度不够;如果验证集的准确率高于训练集,可能是发生了过拟合现象。通过在优化目标中加入正则化项的办法,解决过拟合的问题。

4. 加入正则化项,避免模型过拟合。

飞桨框架支持为整体参数加入正则化项,这是通常的做法。此外,飞桨框架也支持为某一层或某一部分的网络单独加入正则化项,以达到精细调整参数训练的效果。

5. 可视化分析。

用户不仅可以通过打印或使用matplotlib库作图,飞桨还集成了更专业的第三方绘图库tb-paddle,提供便捷的可视化分析。

计算模型的分类准确率

准确率是一个直观衡量分类模型效果的指标,由于这个指标是离散的,因此不适合作为损失来优化。通常情况下,交叉熵损失越小的模型,分类的准确率也越高。基于分类准确率,我们可以公平的比较两种损失函数的优劣,例如2-5节中均方误差和交叉熵的比较。

飞桨提供了计算分类准确率的API,使用fluid.layers.accuracy可以直接计算准确率,该API的输入为预测的分类结果input和对应的标签label。

在下述代码中,我们在模型前向计算过程forward函数中计算分类准确率,并在训练时打印每个批次样本的分类准确率。

  1. # 加载相关库
  2. import os
  3. import random
  4. import paddle
  5. import paddle.fluid as fluid
  6. from paddle.fluid.dygraph.nn import Conv2D, Pool2D, Linear
  7. import numpy as np
  8. from PIL import Image
  9. import gzip
  10. import json
  11. # 定义数据集读取器
  12. def load_data(mode='train'):
  13. # 读取数据文件
  14. datafile = './work/mnist.json.gz'
  15. print('loading mnist dataset from {} ......'.format(datafile))
  16. data = json.load(gzip.open(datafile))
  17. # 读取数据集中的训练集,验证集和测试集
  18. train_set, val_set, eval_set = data
  19. # 数据集相关参数,图片高度IMG_ROWS, 图片宽度IMG_COLS
  20. IMG_ROWS = 28
  21. IMG_COLS = 28
  22. # 根据输入mode参数决定使用训练集,验证集还是测试
  23. if mode == 'train':
  24. imgs = train_set[0]
  25. labels = train_set[1]
  26. elif mode == 'valid':
  27. imgs = val_set[0]
  28. labels = val_set[1]
  29. elif mode == 'eval':
  30. imgs = eval_set[0]
  31. labels = eval_set[1]
  32. # 获得所有图像的数量
  33. imgs_length = len(imgs)
  34. # 验证图像数量和标签数量是否一致
  35. assert len(imgs) == len(labels), \
  36. "length of train_imgs({}) should be the same as train_labels({})".format(
  37. len(imgs), len(labels))
  38. index_list = list(range(imgs_length))
  39. # 读入数据时用到的batchsize
  40. BATCHSIZE = 100
  41. # 定义数据生成器
  42. def data_generator():
  43. # 训练模式下,打乱训练数据
  44. if mode == 'train':
  45. random.shuffle(index_list)
  46. imgs_list = []
  47. labels_list = []
  48. # 按照索引读取数据
  49. for i in index_list:
  50. # 读取图像和标签,转换其尺寸和类型
  51. img = np.reshape(imgs[i], [1, IMG_ROWS, IMG_COLS]).astype('float32')
  52. label = np.reshape(labels[i], [1]).astype('int64')
  53. imgs_list.append(img)
  54. labels_list.append(label)
  55. # 如果当前数据缓存达到了batch size,就返回一个批次数据
  56. if len(imgs_list) == BATCHSIZE:
  57. yield np.array(imgs_list), np.array(labels_list)
  58. # 清空数据缓存列表
  59. imgs_list = []
  60. labels_list = []
  61. # 如果剩余数据的数目小于BATCHSIZE,
  62. # 则剩余数据一起构成一个大小为len(imgs_list)的mini-batch
  63. if len(imgs_list) > 0:
  64. yield np.array(imgs_list), np.array(labels_list)
  65. return data_generator
  66. # 定义模型结构
  67. class MNIST(fluid.dygraph.Layer):
  68. def __init__(self, name_scope):
  69. super(MNIST, self).__init__(name_scope)
  70. name_scope = self.full_name()
  71. # 定义一个卷积层,使用relu激活函数
  72. self.conv1 = Conv2D(num_channels=1, num_filters=20, filter_size=5, stride=1, padding=2, act='relu')
  73. # 定义一个池化层,池化核为2,步长为2,使用最大池化方式
  74. self.pool1 = Pool2D(pool_size=2, pool_stride=2, pool_type='max')
  75. # 定义一个卷积层,使用relu激活函数
  76. self.conv2 = Conv2D(num_channels=20, num_filters=20, filter_size=5, stride=1, padding=2, act='relu')
  77. # 定义一个池化层,池化核为2,步长为2,使用最大池化方式
  78. self.pool2 = Pool2D(pool_size=2, pool_stride=2, pool_type='max')
  79. # 定义一个全连接层,输出节点数为10
  80. self.fc = Linear(input_dim=980, output_dim=10, act='softmax')
  81. # 定义网络的前向计算过程
  82. def forward(self, inputs, label):
  83. x = self.conv1(inputs)
  84. x = self.pool1(x)
  85. x = self.conv2(x)
  86. x = self.pool2(x)
  87. x = fluid.layers.reshape(x, [x.shape[0], 980])
  88. x = self.fc(x)
  89. if label is not None:
  90. acc = fluid.layers.accuracy(input=x, label=label)
  91. return x, acc
  92. else:
  93. return x
  94. #调用加载数据的函数
  95. train_loader = load_data('train')
  96. #在使用GPU机器时,可以将use_gpu变量设置成True
  97. use_gpu = False
  98. place = fluid.CUDAPlace(0) if use_gpu else fluid.CPUPlace()
  99. with fluid.dygraph.guard(place):
  100. model = MNIST("mnist")
  101. model.train()
  102. #四种优化算法的设置方案,可以逐一尝试效果
  103. optimizer = fluid.optimizer.SGDOptimizer(learning_rate=0.01, parameter_list=model.parameters())
  104. #optimizer = fluid.optimizer.MomentumOptimizer(learning_rate=0.01, momentum=0.9, parameter_list=model.parameters())
  105. #optimizer = fluid.optimizer.AdagradOptimizer(learning_rate=0.01, parameter_list=model.parameters())
  106. #optimizer = fluid.optimizer.AdamOptimizer(learning_rate=0.01, parameter_list=model.parameters())
  107. EPOCH_NUM = 5
  108. for epoch_id in range(EPOCH_NUM):
  109. for batch_id, data in enumerate(train_loader()):
  110. #准备数据
  111. image_data, label_data = data
  112. image = fluid.dygraph.to_variable(image_data)
  113. label = fluid.dygraph.to_variable(label_data)
  114. #前向计算的过程,同时拿到模型输出值和分类准确率
  115. predict, acc = model(image, label)
  116. #计算损失,取一个批次样本损失的平均值
  117. loss = fluid.layers.cross_entropy(predict, label)
  118. avg_loss = fluid.layers.mean(loss)
  119. #每训练了200批次的数据,打印下当前Loss的情况
  120. if batch_id % 200 == 0:
  121. print("epoch: {}, batch: {}, loss is: {}, acc is {}".format(epoch_id, batch_id, avg_loss.numpy(), acc.numpy()))
  122. #后向传播,更新参数的过程
  123. avg_loss.backward()
  124. optimizer.minimize(avg_loss)
  125. model.clear_gradients()
  126. #保存模型参数
  127. fluid.save_dygraph(model.state_dict(), 'mnist')
  1. loading mnist dataset from ./work/mnist.json.gz ......
  2. epoch: 0, batch: 0, loss is: [2.796657], acc is [0.04]
  3. epoch: 0, batch: 200, loss is: [0.50403804], acc is [0.88]
  4. epoch: 0, batch: 400, loss is: [0.2659506], acc is [0.92]
  5. epoch: 1, batch: 0, loss is: [0.22079289], acc is [0.92]
  6. epoch: 1, batch: 200, loss is: [0.23240374], acc is [0.92]
  7. epoch: 1, batch: 400, loss is: [0.16370663], acc is [0.95]
  8. epoch: 2, batch: 0, loss is: [0.37291032], acc is [0.92]
  9. epoch: 2, batch: 200, loss is: [0.23772442], acc is [0.92]
  10. epoch: 2, batch: 400, loss is: [0.18071894], acc is [0.95]
  11. epoch: 3, batch: 0, loss is: [0.15938215], acc is [0.95]
  12. epoch: 3, batch: 200, loss is: [0.21112804], acc is [0.92]
  13. epoch: 3, batch: 400, loss is: [0.05794979], acc is [0.99]
  14. epoch: 4, batch: 0, loss is: [0.24466723], acc is [0.93]
  15. epoch: 4, batch: 200, loss is: [0.14045799], acc is [0.96]
  16. epoch: 4, batch: 400, loss is: [0.12366832], acc is [0.94]

检查模型训练过程,识别潜在训练问题

不同于某些深度学习框架的高层API,使用飞桨动态图编程可以方便的查看和调试训练的执行过程。在网络定义的Forward函数中,可以打印每一层输入输出的尺寸,以及每层网络的参数。通过查看这些信息,不仅可以让更好的理解训练的执行过程,还可以发现潜在问题,或者启发继续优化的思路。

在下述程序中,使用check_shape变量控制是否打印“尺寸”,验证网络结构是否正确。使用check_content变量控制是否打印“内容值”,验证数据分布是否合理。假如在训练中发现中间层的部分输出持续为0,说明该部分的网络结构设计存在问题,没有充分利用。

  1. # 定义模型结构
  2. class MNIST(fluid.dygraph.Layer):
  3. def __init__(self, name_scope):
  4. super(MNIST, self).__init__(name_scope)
  5. name_scope = self.full_name()
  6. # 定义一个卷积层,使用relu激活函数
  7. self.conv1 = Conv2D(num_channels=1, num_filters=20, filter_size=5, stride=1, padding=2, act='relu')
  8. # 定义一个池化层,池化核为2,步长为2,使用最大池化方式
  9. self.pool1 = Pool2D(pool_size=2, pool_stride=2, pool_type='max')
  10. # 定义一个卷积层,使用relu激活函数
  11. self.conv2 = Conv2D(num_channels=20, num_filters=20, filter_size=5, stride=1, padding=2, act='relu')
  12. # 定义一个池化层,池化核为2,步长为2,使用最大池化方式
  13. self.pool2 = Pool2D(pool_size=2, pool_stride=2, pool_type='max')
  14. # 定义一个全连接层,输出节点数为10
  15. self.fc = Linear(input_dim=980, output_dim=10, act='softmax')
  16. #加入对每一层输入和输出的尺寸和数据内容的打印,根据check参数决策是否打印每层的参数和输出尺寸
  17. def forward(self, inputs, label=None, check_shape=False, check_content=False):
  18. # 给不同层的输出不同命名,方便调试
  19. outputs1 = self.conv1(inputs)
  20. outputs2 = self.pool1(outputs1)
  21. outputs3 = self.conv2(outputs2)
  22. outputs4 = self.pool2(outputs3)
  23. _outputs4 = fluid.layers.reshape(outputs4, [outputs4.shape[0], -1])
  24. outputs5 = self.fc(_outputs4)
  25. # 选择是否打印神经网络每层的参数尺寸和输出尺寸,验证网络结构是否设置正确
  26. if check_shape:
  27. # 打印每层网络设置的超参数-卷积核尺寸,卷积步长,卷积padding,池化核尺寸
  28. print("\n########## print network layer's superparams ##############")
  29. print("conv1-- kernel_size:{}, padding:{}, stride:{}".format(self.conv1.weight.shape, self.conv1._padding, self.conv1._stride))
  30. print("conv2-- kernel_size:{}, padding:{}, stride:{}".format(self.conv2.weight.shape, self.conv2._padding, self.conv2._stride))
  31. print("pool1-- pool_type:{}, pool_size:{}, pool_stride:{}".format(self.pool1._pool_type, self.pool1._pool_size, self.pool1._pool_stride))
  32. print("pool2-- pool_type:{}, poo2_size:{}, pool_stride:{}".format(self.pool2._pool_type, self.pool2._pool_size, self.pool2._pool_stride))
  33. print("fc-- weight_size:{}, bias_size_{}, activation:{}".format(self.fc.weight.shape, self.fc.bias.shape, self.fc._act))
  34. # 打印每层的输出尺寸
  35. print("\n########## print shape of features of every layer ###############")
  36. print("inputs_shape: {}".format(inputs.shape))
  37. print("outputs1_shape: {}".format(outputs1.shape))
  38. print("outputs2_shape: {}".format(outputs2.shape))
  39. print("outputs3_shape: {}".format(outputs3.shape))
  40. print("outputs4_shape: {}".format(outputs4.shape))
  41. print("outputs5_shape: {}".format(outputs5.shape))
  42. # 选择是否打印训练过程中的参数和输出内容,可用于训练过程中的调试
  43. if check_content:
  44. # 打印卷积层的参数-卷积核权重,权重参数较多,此处只打印部分参数
  45. print("\n########## print convolution layer's kernel ###############")
  46. print("conv1 params -- kernel weights:", self.conv1.weight[0][0])
  47. print("conv2 params -- kernel weights:", self.conv2.weight[0][0])
  48. # 创建随机数,随机打印某一个通道的输出值
  49. idx1 = np.random.randint(0, outputs1.shape[1])
  50. idx2 = np.random.randint(0, outputs3.shape[1])
  51. # 打印卷积-池化后的结果,仅打印batch中第一个图像对应的特征
  52. print("\nThe {}th channel of conv1 layer: ".format(idx1), outputs1[0][idx1])
  53. print("The {}th channel of conv2 layer: ".format(idx2), outputs3[0][idx2])
  54. print("The output of last layer:", outputs5[0], '\n')
  55. # 如果label不是None,则计算分类精度并返回
  56. if label is not None:
  57. acc = fluid.layers.accuracy(input=outputs5, label=label)
  58. return outputs5, acc
  59. else:
  60. return outputs5
  61. #在使用GPU机器时,可以将use_gpu变量设置成True
  62. use_gpu = False
  63. place = fluid.CUDAPlace(0) if use_gpu else fluid.CPUPlace()
  64. with fluid.dygraph.guard(place):
  65. model = MNIST("mnist")
  66. model.train()
  67. #四种优化算法的设置方案,可以逐一尝试效果
  68. optimizer = fluid.optimizer.SGDOptimizer(learning_rate=0.01, parameter_list=model.parameters())
  69. #optimizer = fluid.optimizer.MomentumOptimizer(learning_rate=0.01, momentum=0.9, parameter_list=model.parameters())
  70. #optimizer = fluid.optimizer.AdagradOptimizer(learning_rate=0.01, parameter_list=model.parameters())
  71. #optimizer = fluid.optimizer.AdamOptimizer(learning_rate=0.01, parameter_list=model.parameters())
  72. EPOCH_NUM = 1
  73. for epoch_id in range(EPOCH_NUM):
  74. for batch_id, data in enumerate(train_loader()):
  75. #准备数据,变得更加简洁
  76. image_data, label_data = data
  77. image = fluid.dygraph.to_variable(image_data)
  78. label = fluid.dygraph.to_variable(label_data)
  79. #前向计算的过程,同时拿到模型输出值和分类准确率
  80. if batch_id == 0 and epoch_id==0:
  81. # 打印模型参数和每层输出的尺寸
  82. predict, acc = model(image, label, check_shape=True, check_content=False)
  83. elif batch_id==401:
  84. # 打印模型参数和每层输出的值
  85. predict, acc = model(image, label, check_shape=False, check_content=True)
  86. else:
  87. predict, acc = model(image, label)
  88. #计算损失,取一个批次样本损失的平均值
  89. loss = fluid.layers.cross_entropy(predict, label)
  90. avg_loss = fluid.layers.mean(loss)
  91. #每训练了100批次的数据,打印下当前Loss的情况
  92. if batch_id % 200 == 0:
  93. print("epoch: {}, batch: {}, loss is: {}, acc is {}".format(epoch_id, batch_id, avg_loss.numpy(), acc.numpy()))
  94. #后向传播,更新参数的过程
  95. avg_loss.backward()
  96. optimizer.minimize(avg_loss)
  97. model.clear_gradients()
  98. #保存模型参数
  99. fluid.save_dygraph(model.state_dict(), 'mnist')
  1. ########## print network layer's superparams ##############
  2. conv1-- kernel_size:[20, 1, 5, 5], padding:[2, 2], stride:[1, 1]
  3. conv2-- kernel_size:[20, 20, 5, 5], padding:[2, 2], stride:[1, 1]
  4. pool1-- pool_type:max, pool_size:[2, 2], pool_stride:[2, 2]
  5. pool2-- pool_type:max, poo2_size:[2, 2], pool_stride:[2, 2]
  6. fc-- weight_size:[980, 10], bias_size_[10], activation:softmax
  7.  
  8. ########## print shape of features of every layer ###############
  9. inputs_shape: [100, 1, 28, 28]
  10. outputs1_shape: [100, 20, 28, 28]
  11. outputs2_shape: [100, 20, 14, 14]
  12. outputs3_shape: [100, 20, 14, 14]
  13. outputs4_shape: [100, 20, 7, 7]
  14. outputs5_shape: [100, 10]
  15. epoch: 0, batch: 0, loss is: [2.7973385], acc is [0.11]
  16. epoch: 0, batch: 200, loss is: [0.39955115], acc is [0.91]
  17. epoch: 0, batch: 400, loss is: [0.33634004], acc is [0.88]
  18.  
  19. ########## print convolution layer's kernel ###############
  20. conv1 params -- kernel weights: name tmp_9640, dtype: VarType.FP32 shape: [5, 5] lod: {}
  21. dim: 5, 5
  22. layout: NCHW
  23. dtype: float
  24. data: [0.130294 0.141945 0.277612 -0.0341354 0.274604 0.768543 -0.27499 0.251671 -0.295482 0.165828 0.0257012 0.426327 -0.181795 -0.18254 -0.0629882 0.229182 0.582581 -0.0630182 -0.425838 0.0604263 0.0152962 0.421092 0.461917 0.157516 0.175732]
  25.  
  26. conv2 params -- kernel weights: name tmp_9642, dtype: VarType.FP32 shape: [5, 5] lod: {}
  27. dim: 5, 5
  28. layout: NCHW
  29. dtype: float
  30. data: [0.0617335 0.0566256 0.0415355 0.0576892 0.0206998 -0.0190645 -0.0968885 -0.0496854 -0.0412527 -0.120323 0.0807334 0.00299106 -0.137337 0.0409209 -0.0233036 0.0249631 0.0875957 0.0634051 -0.0977861 -0.00262945 -0.0182598 0.158969 -0.0510049 -0.0173583 -0.0254823]
  31.  
  32.  
  33. The 12th channel of conv1 layer: name tmp_9644, dtype: VarType.FP32 shape: [28, 28] lod: {}
  34. dim: 28, 28
  35. layout: NCHW
  36. dtype: float
  37. data: [0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.00240837 0.0083664 0.0844863 0.281837 0.295509 0.215247 0.404363 0.321101 0.284769 0.150322 0.0216064 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.0168698 0.127761 0.236135 0.190262 0.446316 0.654536 0.671763 0.833326 1.01869 1.10039 0.75834 0.604835 0.477339 0.16102 0.00276978 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.00316949 0.0960481 0.350652 0.450743 0.509328 0.962064 1.07736 1.26792 1.8439 1.8507 1.72574 1.05981 0.722284 0.543205 0.349362 0.096226 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.0814335 0.320253 0.445906 0.79822 1.2974 1.57096 1.57578 1.82453 1.91194 1.82641 1.42866 1.05207 0.545672 0.267502 0.371449 0.226131 0.0140718 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.0130641 0.202817 0.462016 0.725797 1.54755 2.30225 2.11353 1.62037 1.7291 0.864632 0.471809 0.835434 1.20026 0.794757 0.176704 0.26521 0.164981 0.0249941 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.037731 0.234838 0.804231 1.41322 2.00855 2.09632 1.73359 0.825281 0.0576022 0 0.27331 0.974567 1.04022 1.10965 0.587231 0.327883 0 0.0818419 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.00393062 0.133162 0.464741 1.26449 1.7812 2.08314 1.43935 0.797106 0 0 0 0.109657 0.864885 1.40979 1.26073 0.78487 0.371346 0 0.11242 0.00465344 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.0161863 0.242318 0.866198 1.38395 1.85737 1.8107 0.751302 0 0 0 0.0796259 0.297033 0.263382 1.08264 1.3148 1.09144 0.270776 0.270379 0.111085 0.0182688 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.0170507 0.251553 1.34972 1.30648 1.8173 1.11 0.105474 0 0 0.236112 0.218666 0.23884 0.401473 0.947582 1.07505 1.17777 0.603197 0.396545 0.0175459 0.0980035 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.0826 0.498423 1.51431 1.5091 1.46719 0.334762 0 0 0.12182 0.0384343 0.000886112 0.0905787 0.456781 0.674324 1.05478 1.36152 0.989945 0.436806 0.0427184 0.113194 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.146334 0.793946 1.39816 1.65908 0.962283 0.190517 0 0 0.0910018 0.000886112 0.000886112 0.000886112 0.370017 0.822924 1.14654 1.21709 1.38725 0.495834 0.0621324 0 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.118514 1.11391 1.37923 1.49307 0.6497 0.230102 0 0.194759 0.000886112 0.000886112 0.000886112 0.000886112 0.246839 0.933467 1.396 1.18055 1.17421 0.66048 0.132468 0 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.118001 1.09487 1.56925 1.25818 0.731015 0.309763 0.209551 0.184754 0.000886112 0.000886112 0.000886112 0.000886112 0.246839 0.933467 1.3948 1.24923 0.978691 0.654567 0 0 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.221429 0.94406 1.49543 1.14577 1.02138 0.509097 0.277652 0 0.000886112 0.000886112 0.000886112 0.000886112 0.246839 0.933467 1.3948 1.29317 0.82723 0.576063 0 0 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.306368 0.623001 1.36865 1.11073 1.43799 0.629717 0.363034 0.0348355 0.164506 0.0291411 0.000886112 0.000886112 0.246839 0.933467 1.36512 1.26009 0.633233 0.411065 0 0 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.207445 0.388377 0.984718 1.32262 1.58241 0.795908 0.437326 0.102477 0.447229 0.177846 0.0281992 0.0275255 0.378886 1.03947 1.21364 1.16124 0.853845 0.158215 0 0.074752 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.0648479 0.432102 0.850783 1.35104 1.1678 1.05464 0.8816 0.628019 0.570563 0.470794 0.424004 0.395784 0.618224 1.12873 1.48962 1.22976 1.00493 0.00206797 0 0 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.0358494 0.488894 0.309899 0.873312 1.02131 1.50028 1.35425 0.996981 0.716021 0.864789 0.962285 0.906402 1.11496 1.84725 1.89103 1.16657 0.49881 0 0 0.000129538 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.0152369 0.276925 0.196602 0.328295 0.919031 1.46421 1.39818 1.3121 1.04393 0.947039 1.07317 1.32438 1.91999 2.16512 1.9415 0.5031 0 0 0.0427221 0.0159054 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.0367631 0.370741 0.283655 0.524042 0.956683 1.10184 1.08887 1.32167 1.37813 1.21994 1.47466 2.00459 1.94654 0.482016 0 0 0 0.0766796 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.214989 0.301445 0.0810955 0.31009 0.882066 0.915763 0.934223 1.1276 1.20094 1.21286 0.95197 0.190858 0 0 0 0.107981 0.0270438 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.0331754 0.164097 0.192368 0 0.273793 0.756539 0.745563 0.627643 0.739448 0.611393 0 0 0 0 0.139894 0.130948 0.00338933 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.00327791 0.111154 0.248525 0.0834989 0.00723018 0.195666 0.204233 0.206362 0.0285358 0 0 0 0.218371 0.114803 0.00338933 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.108517 0.0597624 0 0 0 0 0 0.0431057 0.28543 0.192643 0.0321763 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112 0.000886112]
  38.  
  39. The 13th channel of conv2 layer: name tmp_9646, dtype: VarType.FP32 shape: [14, 14] lod: {}
  40. dim: 14, 14
  41. layout: NCHW
  42. dtype: float
  43. data: [0 0 0 0 0 0 0 0 0.982631 2.07982 2.10505 1.07689 0.196902 0 0 0 0 0 0 0 0 0.270001 2.19605 3.58746 3.24089 2.11906 0.702639 0.00795837 0 0 0 0 0 0.501559 2.61415 1.39464 1.84258 2.75804 3.23682 2.72248 1.29679 0.106989 0 0 0 0 0.236302 4.64772 5.64405 2.59868 0 0.442616 2.01529 2.30798 1.57843 0.405615 0 0 0 0 3.22009 6.68869 5.40855 0.549758 0 0 1.63176 1.55094 1.83655 0.718908 0 0 0 0 4.9766 5.89094 2.78872 0 0 0 1.86081 2.12429 1.75747 0.701252 0 0 0 0 4.01424 3.74163 1.17048 0 0 0 2.63059 2.8462 1.50005 0.365052 0 0 0 0 1.999 2.16182 1.5617 0 0 0.506743 3.62704 3.08108 0.788893 0 0.0207858 0 0 0 0.0406673 0.433839 0 0.451221 0.51952 2.68384 4.71427 2.89998 0.166464 0 0.160387 0.292279 0 0 0 0 0 0.543841 2.57541 5.02874 5.19645 1.82011 0 0 0.076184 0.270517 0.688959 0.399104 0 0 0 0.534514 2.982 4.66138 3.26751 0.0674674 0 0 0.130428 0.0553832 0.629012 1.16743 1.17101 1.56561 1.29999 1.29871 2.1356 1.9621 0.333023 0 0 0 0.0210004 0.00945744 0.189894 0.574276 0.911643 1.33927 1.72275 0.898636 0.339283 0 0 0 0 0 0.0115802 0 0 0.230149 0.526857 0.562351 0.324584 0.0107017 0 0 0 0 0 0]
  44.  
  45. The output of last layer: name tmp_9647, dtype: VarType.FP32 shape: [10] lod: {}
  46. dim: 10
  47. layout: NCHW
  48. dtype: float
  49. data: [0.996018 2.24414e-08 0.0012224 3.51778e-05 1.104e-06 0.000856923 0.000374143 1.57007e-05 0.00130874 0.000167662]
  50.  
  51.  

加入校验或测试,更好评价模型效果

在训练过程中,我们会发现模型在训练样本集上的损失在不断减小。但这是否代表模型在未来的应用场景上依然有效?为了验证模型的有效性,通常将样本集合分成三份,训练集、校验集和测试集。

  • 训练集 :用于训练模型的参数,即训练过程中主要完成的工作。
  • 校验集 :用于对模型超参数的选择,比如网络结构的调整、正则化项权重的选择等。
  • 测试集 :用于模拟模型在应用后的真实效果。因为测试集没有参与任何模型优化或参数训练的工作,所以它对模型来说是完全未知的样本。在不以校验数据优化网络结构或模型超参数时,校验数据和测试数据的效果是类似的,均更真实的反映模型效果。

如下程序读取上一步训练保存的模型参数,读取校验数据集,并测试模型在校验数据集上的效果。

  1. with fluid.dygraph.guard():
  2. print('start evaluation .......')
  3. #加载模型参数
  4. model = MNIST("mnist")
  5. model_state_dict, _ = fluid.load_dygraph('mnist')
  6. model.load_dict(model_state_dict)
  7. model.eval()
  8. eval_loader = load_data('eval')
  9. acc_set = []
  10. avg_loss_set = []
  11. for batch_id, data in enumerate(eval_loader()):
  12. x_data, y_data = data
  13. img = fluid.dygraph.to_variable(x_data)
  14. label = fluid.dygraph.to_variable(y_data)
  15. prediction, acc = model(img, label)
  16. loss = fluid.layers.cross_entropy(input=prediction, label=label)
  17. avg_loss = fluid.layers.mean(loss)
  18. acc_set.append(float(acc.numpy()))
  19. avg_loss_set.append(float(avg_loss.numpy()))
  20. #计算多个batch的平均损失和准确率
  21. acc_val_mean = np.array(acc_set).mean()
  22. avg_loss_val_mean = np.array(avg_loss_set).mean()
  23. print('loss={}, acc={}'.format(avg_loss_val_mean, acc_val_mean))
  1. start evaluation .......
  2. loading mnist dataset from ./work/mnist.json.gz ......
  3. loss=0.2429321705363691, acc=0.9323000007867813

从测试的效果来看,模型在从来没有见过的数据集上依然有93%的准确率,证明它是有预测效果的。

加入正则化项,避免模型过拟合

过拟合现象

对于样本量有限、但需要使用强大模型的复杂任务,模型很容易出现过拟合的表现,即在训练集上的损失小,在验证集或测试集上的损失较大,如 图2 所示。

【手写数字识别】之训练调试与优化 - 图2

图2:过拟合现象,训练误差不断降低,但测试误差先降后增

反之,如果模型在训练集和测试集上均损失较大,则称为欠拟合。过拟合表示模型过于敏感,学习到了训练数据中的一些误差,而这些误差并不是真实的泛化规律(可推广到测试集上的规律)。欠拟合表示模型还不够强大,还没有很好的拟合已知的训练样本,更别提测试样本了。因为欠拟合情况容易观察和解决,只要训练loss不够好,就不断使用更强大的模型即可,因此实际中我们更需要处理好过拟合的问题。

导致过拟合原因

造成过拟合的原因是模型过于敏感,而训练数据量太少或其中的噪音太多。

图3 所示,理想的回归模型是一条坡度较缓的抛物线,欠拟合的模型只拟合出一条直线,显然没有捕捉到真实的规律,但过拟合的模型拟合出存在很多拐点的抛物线,显然是过于敏感,也没有正确表达真实规律。

【手写数字识别】之训练调试与优化 - 图3

图3:回归模型的过拟合,理想和欠拟合状态的表现

图4 所示,理想的分类模型是一条半圆形的曲线,欠拟合用直线作为分类边界,显然没有捕捉到真实的边界,但过拟合的模型拟合出很扭曲的分类边界,虽然对所有的训练数据正确分类,但对一些较为个例的样本所做出的妥协,高概率不是真实的规律。

【手写数字识别】之训练调试与优化 - 图4

图4:分类模型的欠拟合,理想和过拟合状态的表现

过拟合的成因与防控

为了更好的理解过拟合的成因,可以参考侦探定位罪犯的案例逻辑,如 图5 所示。

【手写数字识别】之训练调试与优化 - 图5

图5:侦探定位罪犯与模型假设示意

对于这个案例,假设侦探也会犯错,通过分析可能会由于如下两种情况导致:

  • 情况1:罪犯证据存在错误,依据错误的证据寻找罪犯肯定是缘木求鱼。

  • 情况2:搜索范围太大的同时证据太少,导致符合条件的候选(嫌疑人)太多,无法准确定位罪犯。

那么侦探解决这个问题的方法有两种:或者缩小搜索范围(比如假设该案件只能是熟人作案),或者寻找更多的证据。

归结到深度学习中,假设模型也会犯错,通过分析可能会由于如下两种情况导致:

  • 情况1:训练数据存在噪音,导致模型学到了噪音,而不是真实规律。

  • 情况2:使用强大模型(表示空间大)的同时训练数据太少,导致在训练数据上表现良好的候选假设太多,锁定了一个“虚假正确”的假设。

对于情况1,我们使用数据清洗和修正来解决。 对于情况2,我们或者限制模型表示能力,或者收集更多的训练数据。

而清洗训练数据中的错误,或收集更多的训练数据往往是一句“正确的废话”,在任何时候我们都想获得更多更高质量的数据。在实际项目中,更快、更低成本可控制过拟合的方法,只有限制模型的表示能力。

正则化项

为了防止模型过拟合,在没有扩充样本量的可能下,只能降低模型的复杂度,可以通过限制参数的数量或可能取值(参数值尽量小)实现。

具体来说,在模型的优化目标(损失)中人为加入对参数规模的惩罚项。当参数越多或取值越大时,该惩罚项就越大。通过调整惩罚项的权重系数,可以使模型在“尽量减少训练损失”和“保持模型的泛化能力”之间取得平衡。泛化能力表示模型在没有见过的样本上依然有效。正则化项的存在,增加了模型在训练集上的损失。

飞桨框架支持为所有参数加上统一的正则化项,也支持为特定的参数添加正则化项。前者的实现如下代码所示,仅在优化器中设置regularization参数即可实现。使用参数regularization_coeff调节正则化项的权重,权重越大时,对模型复杂度的惩罚越高。实现代码如下所示。

  1. with fluid.dygraph.guard():
  2. model = MNIST("mnist")
  3. model.train()
  4. #四种优化算法的设置方案,可以逐一尝试效果
  5. optimizer = fluid.optimizer.SGDOptimizer(learning_rate=0.01, parameter_list=model.parameters())
  6. #optimizer = fluid.optimizer.MomentumOptimizer(learning_rate=0.01, momentum=0.9, parameter_list=model.parameters())
  7. #optimizer = fluid.optimizer.AdagradOptimizer(learning_rate=0.01, parameter_list=model.parameters())
  8. #optimizer = fluid.optimizer.AdamOptimizer(learning_rate=0.01, parameter_list=model.parameters())
  9. #各种优化算法均可以加入正则化项,避免过拟合,参数regularization_coeff调节正则化项的权重
  10. #optimizer = fluid.optimizer.SGDOptimizer(learning_rate=0.01, regularization=fluid.regularizer.L2Decay(regularization_coeff=0.1),parameter_list=model.parameters()))
  11. optimizer = fluid.optimizer.AdamOptimizer(learning_rate=0.01, regularization=fluid.regularizer.L2Decay(regularization_coeff=0.1),
  12. parameter_list=model.parameters())
  13. EPOCH_NUM = 10
  14. for epoch_id in range(EPOCH_NUM):
  15. for batch_id, data in enumerate(train_loader()):
  16. #准备数据,变得更加简洁
  17. image_data, label_data = data
  18. image = fluid.dygraph.to_variable(image_data)
  19. label = fluid.dygraph.to_variable(label_data)
  20. #前向计算的过程,同时拿到模型输出值和分类准确率
  21. predict, acc = model(image, label)
  22. #计算损失,取一个批次样本损失的平均值
  23. loss = fluid.layers.cross_entropy(predict, label)
  24. avg_loss = fluid.layers.mean(loss)
  25. #每训练了100批次的数据,打印下当前Loss的情况
  26. if batch_id % 100 == 0:
  27. print("epoch: {}, batch: {}, loss is: {}, acc is {}".format(epoch_id, batch_id, avg_loss.numpy(), acc.numpy()))
  28. #后向传播,更新参数的过程
  29. avg_loss.backward()
  30. optimizer.minimize(avg_loss)
  31. model.clear_gradients()
  32. #保存模型参数
  33. fluid.save_dygraph(model.state_dict(), 'mnist')
  1. epoch: 0, batch: 0, loss is: [2.983608], acc is [0.08]
  2. epoch: 0, batch: 100, loss is: [0.31696996], acc is [0.93]
  3. epoch: 0, batch: 200, loss is: [0.23579603], acc is [0.95]
  4. epoch: 0, batch: 300, loss is: [0.30947688], acc is [0.94]
  5. epoch: 0, batch: 400, loss is: [0.34581575], acc is [0.92]
  6. epoch: 1, batch: 0, loss is: [0.36142498], acc is [0.88]
  7. epoch: 1, batch: 100, loss is: [0.42593586], acc is [0.86]
  8. epoch: 1, batch: 200, loss is: [0.39669245], acc is [0.86]
  9. epoch: 1, batch: 300, loss is: [0.305471], acc is [0.95]
  10. epoch: 1, batch: 400, loss is: [0.34134996], acc is [0.93]
  11. epoch: 2, batch: 0, loss is: [0.23979886], acc is [0.97]
  12. epoch: 2, batch: 100, loss is: [0.38038406], acc is [0.87]
  13. epoch: 2, batch: 200, loss is: [0.37446725], acc is [0.9]
  14. epoch: 2, batch: 300, loss is: [0.28386426], acc is [0.94]
  15. epoch: 2, batch: 400, loss is: [0.25222763], acc is [0.95]
  16. epoch: 3, batch: 0, loss is: [0.2882554], acc is [0.93]
  17. epoch: 3, batch: 100, loss is: [0.29589745], acc is [0.93]
  18. epoch: 3, batch: 200, loss is: [0.45023206], acc is [0.86]
  19. epoch: 3, batch: 300, loss is: [0.32502505], acc is [0.94]
  20. epoch: 3, batch: 400, loss is: [0.30494598], acc is [0.94]
  21. epoch: 4, batch: 0, loss is: [0.24612927], acc is [0.95]
  22. epoch: 4, batch: 100, loss is: [0.27296785], acc is [0.95]
  23. epoch: 4, batch: 200, loss is: [0.28979322], acc is [0.91]
  24. epoch: 4, batch: 300, loss is: [0.33546743], acc is [0.92]
  25. epoch: 4, batch: 400, loss is: [0.3345272], acc is [0.93]
  26. epoch: 5, batch: 0, loss is: [0.44861904], acc is [0.88]
  27. epoch: 5, batch: 100, loss is: [0.34376755], acc is [0.91]
  28. epoch: 5, batch: 200, loss is: [0.2697858], acc is [0.92]
  29. epoch: 5, batch: 300, loss is: [0.32049623], acc is [0.94]
  30. epoch: 5, batch: 400, loss is: [0.30303395], acc is [0.94]
  31. epoch: 6, batch: 0, loss is: [0.33023044], acc is [0.93]
  32. epoch: 6, batch: 100, loss is: [0.3181182], acc is [0.94]
  33. epoch: 6, batch: 200, loss is: [0.38028607], acc is [0.92]
  34. epoch: 6, batch: 300, loss is: [0.28176853], acc is [0.94]
  35. epoch: 6, batch: 400, loss is: [0.48122022], acc is [0.87]
  36. epoch: 7, batch: 0, loss is: [0.45256972], acc is [0.87]
  37. epoch: 7, batch: 100, loss is: [0.31777906], acc is [0.91]
  38. epoch: 7, batch: 200, loss is: [0.23670812], acc is [0.96]
  39. epoch: 7, batch: 300, loss is: [0.33085632], acc is [0.91]
  40. epoch: 7, batch: 400, loss is: [0.34183657], acc is [0.93]
  41. epoch: 8, batch: 0, loss is: [0.3252389], acc is [0.9]
  42. epoch: 8, batch: 100, loss is: [0.37204933], acc is [0.88]
  43. epoch: 8, batch: 200, loss is: [0.24252795], acc is [0.94]
  44. epoch: 8, batch: 300, loss is: [0.46807605], acc is [0.84]
  45. epoch: 8, batch: 400, loss is: [0.46782267], acc is [0.84]
  46. epoch: 9, batch: 0, loss is: [0.27448088], acc is [0.92]
  47. epoch: 9, batch: 100, loss is: [0.36983737], acc is [0.9]
  48. epoch: 9, batch: 200, loss is: [0.40327495], acc is [0.88]
  49. epoch: 9, batch: 300, loss is: [0.349224], acc is [0.92]
  50. epoch: 9, batch: 400, loss is: [0.3286616], acc is [0.89]

可视化分析

训练模型时,经常需要观察模型的评价指标,分析模型的优化过程,以确保训练是有效的。可视化分析有两种工具:Matplotlib库和tb-paddle。

  • Matplotlib库:Matplotlib库是Python中使用的最多的2D图形绘图库,它有一套完全仿照MATLAB的函数形式的绘图接口,使用轻量级的PLT库(Matplotlib)作图是非常简单的。
  • tb-paddle:如果期望使用更加专业的作图工具,可以尝试tb-paddle。tb-paddle能够有效地展示飞桨框架在运行过程中的计算图、各种指标随着时间的变化趋势以及训练中使用到的数据信息。

使用Matplotlib库绘制损失随训练下降的曲线图

将训练的批次编号作为X轴坐标,该批次的训练损失作为Y轴坐标。

  • 训练开始前,声明两个列表变量存储对应的批次编号(iters=[])和训练损失(losses=[])。

  • 随着训练的进行,将iter和losses两个列表填满。

  • 训练结束后,将两份数据以参数形式导入PLT的横纵坐标。

  1. plt.xlabel("iter", fontsize=14),plt.ylabel("loss", fontsize=14)
  • 最后,调用plt.plot()函数即可完成作图。
  1. #引入matplotlib库
  2. import matplotlib.pyplot as plt
  3. with fluid.dygraph.guard(place):
  4. model = MNIST("mnist")
  5. model.train()
  6. #四种优化算法的设置方案,可以逐一尝试效果
  7. optimizer = fluid.optimizer.SGDOptimizer(learning_rate=0.01, parameter_list=model.parameters())
  8. EPOCH_NUM = 10
  9. iter=0
  10. iters=[]
  11. losses=[]
  12. for epoch_id in range(EPOCH_NUM):
  13. for batch_id, data in enumerate(train_loader()):
  14. #准备数据,变得更加简洁
  15. image_data, label_data = data
  16. image = fluid.dygraph.to_variable(image_data)
  17. label = fluid.dygraph.to_variable(label_data)
  18. #前向计算的过程,同时拿到模型输出值和分类准确率
  19. predict, acc = model(image, label)
  20. #计算损失,取一个批次样本损失的平均值
  21. loss = fluid.layers.cross_entropy(predict, label)
  22. avg_loss = fluid.layers.mean(loss)
  23. #每训练了100批次的数据,打印下当前Loss的情况
  24. if batch_id % 100 == 0:
  25. print("epoch: {}, batch: {}, loss is: {}, acc is {}".format(epoch_id, batch_id, avg_loss.numpy(), acc.numpy()))
  26. iters.append(iter)
  27. losses.append(avg_loss.numpy())
  28. iter = iter + 100
  29. #后向传播,更新参数的过程
  30. avg_loss.backward()
  31. optimizer.minimize(avg_loss)
  32. model.clear_gradients()
  33. #保存模型参数
  34. fluid.save_dygraph(model.state_dict(), 'mnist')
  1. 2020-03-17 11:27:53,733-INFO: font search path ['/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/matplotlib/mpl-data/fonts/ttf', '/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/matplotlib/mpl-data/fonts/afm', '/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/matplotlib/mpl-data/fonts/pdfcorefonts']
  2. 2020-03-17 11:27:54,419-INFO: generated new fontManager
  1. epoch: 0, batch: 0, loss is: [2.4138677], acc is [0.05]
  2. epoch: 0, batch: 100, loss is: [0.7772682], acc is [0.81]
  3. epoch: 0, batch: 200, loss is: [0.6286463], acc is [0.84]
  4. epoch: 0, batch: 300, loss is: [0.2760252], acc is [0.92]
  5. epoch: 0, batch: 400, loss is: [0.30962068], acc is [0.88]
  6. epoch: 1, batch: 0, loss is: [0.23597308], acc is [0.94]
  7. epoch: 1, batch: 100, loss is: [0.19408596], acc is [0.94]
  8. epoch: 1, batch: 200, loss is: [0.26218292], acc is [0.92]
  9. epoch: 1, batch: 300, loss is: [0.17599595], acc is [0.95]
  10. epoch: 1, batch: 400, loss is: [0.20694718], acc is [0.95]
  11. epoch: 2, batch: 0, loss is: [0.18106407], acc is [0.94]
  12. epoch: 2, batch: 100, loss is: [0.2439658], acc is [0.94]
  13. epoch: 2, batch: 200, loss is: [0.17587942], acc is [0.92]
  14. epoch: 2, batch: 300, loss is: [0.13686633], acc is [0.95]
  15. epoch: 2, batch: 400, loss is: [0.14796607], acc is [0.96]
  16. epoch: 3, batch: 0, loss is: [0.12020161], acc is [0.97]
  17. epoch: 3, batch: 100, loss is: [0.09394793], acc is [0.98]
  18. epoch: 3, batch: 200, loss is: [0.15918276], acc is [0.97]
  19. epoch: 3, batch: 300, loss is: [0.06426619], acc is [0.97]
  20. epoch: 3, batch: 400, loss is: [0.13216533], acc is [0.95]
  21. epoch: 4, batch: 0, loss is: [0.17594793], acc is [0.93]
  22. epoch: 4, batch: 100, loss is: [0.15788814], acc is [0.95]
  23. epoch: 4, batch: 200, loss is: [0.0793974], acc is [0.98]
  24. epoch: 4, batch: 300, loss is: [0.1386601], acc is [0.97]
  25. epoch: 4, batch: 400, loss is: [0.20907125], acc is [0.94]
  26. epoch: 5, batch: 0, loss is: [0.11960445], acc is [0.96]
  27. epoch: 5, batch: 100, loss is: [0.1305021], acc is [0.95]
  28. epoch: 5, batch: 200, loss is: [0.07436194], acc is [0.96]
  29. epoch: 5, batch: 300, loss is: [0.06267592], acc is [0.99]
  30. epoch: 5, batch: 400, loss is: [0.08205643], acc is [0.99]
  31. epoch: 6, batch: 0, loss is: [0.10441803], acc is [0.98]
  32. epoch: 6, batch: 100, loss is: [0.11585644], acc is [0.96]
  33. epoch: 6, batch: 200, loss is: [0.10197936], acc is [0.97]
  34. epoch: 6, batch: 300, loss is: [0.15867928], acc is [0.98]
  35. epoch: 6, batch: 400, loss is: [0.12354293], acc is [0.95]
  36. epoch: 7, batch: 0, loss is: [0.08421096], acc is [0.96]
  37. epoch: 7, batch: 100, loss is: [0.04428976], acc is [0.98]
  38. epoch: 7, batch: 200, loss is: [0.03700006], acc is [1.]
  39. epoch: 7, batch: 300, loss is: [0.03976982], acc is [1.]
  40. epoch: 7, batch: 400, loss is: [0.07824476], acc is [0.98]
  41. epoch: 8, batch: 0, loss is: [0.09281676], acc is [0.96]
  42. epoch: 8, batch: 100, loss is: [0.16775486], acc is [0.96]
  43. epoch: 8, batch: 200, loss is: [0.03005493], acc is [0.99]
  44. epoch: 8, batch: 300, loss is: [0.08346601], acc is [0.97]
  45. epoch: 8, batch: 400, loss is: [0.09771224], acc is [0.96]
  46. epoch: 9, batch: 0, loss is: [0.06451213], acc is [0.98]
  47. epoch: 9, batch: 100, loss is: [0.05508567], acc is [0.98]
  48. epoch: 9, batch: 200, loss is: [0.08154609], acc is [0.98]
  49. epoch: 9, batch: 300, loss is: [0.18399465], acc is [0.95]
  50. epoch: 9, batch: 400, loss is: [0.04534116], acc is [0.98]
  1. <Figure size 640x480 with 1 Axes>
  1. #画出训练过程中Loss的变化曲线
  2. plt.figure()
  3. plt.title("train loss", fontsize=24)
  4. plt.xlabel("iter", fontsize=14)
  5. plt.ylabel("loss", fontsize=14)
  6. plt.plot(iters, losses,color='red',label='train loss')
  7. plt.grid()
  8. plt.show()

【手写数字识别】之训练调试与优化 - 图6

  1. <Figure size 432x288 with 1 Axes>

使用tb-paddle可视化分析

tb-paddle由第三方生态研发集成进飞桨,它的使用不复杂,可分为四个步骤。


说明:

本案例不支持AI studio演示,请读者在本地安装的飞桨上实践。


  • 步骤1:引入tb_paddle库,定义作图数据存储位置(供第3步使用),本案例的路径是“log/data”。
  1. from tb_paddle import SummaryWriter
  2. data_writer = SummaryWriter(logdir="log/data")
  • 步骤2:在训练过程中插入作图语句。当每100个batch训练完成后,将当前损失作为一个新增的数据点(scalar_x和loss的映射对)存储到第一步设置的文件中。使用变量scalar_x记录下已经训练的批次数,作为作图的X轴坐标。
  1. data_writer.add_scalar("train/loss", avg_loss.numpy(), scalar_x)
  2. data_writer.add_scalar("train/accuracy", avg_acc.numpy(), scalar_x)
  3. scalar_x = scalar_x + 100
  1. #引入Tensorboard库,并设定保存作图数据的文件位置
  2. from tb_paddle import SummaryWriter
  3. data_writer = SummaryWriter(logdir="log/data")
  4. with fluid.dygraph.guard(place):
  5. model = MNIST("mnist")
  6. model.train()
  7. #四种优化算法的设置方案,可以逐一尝试效果
  8. optimizer = fluid.optimizer.SGDOptimizer(learning_rate=0.01, parameter_list=model.parameters())
  9. EPOCH_NUM = 10
  10. iter = 0
  11. for epoch_id in range(EPOCH_NUM):
  12. for batch_id, data in enumerate(train_loader()):
  13. #准备数据,变得更加简洁
  14. image_data, label_data = data
  15. image = fluid.dygraph.to_variable(image_data)
  16. label = fluid.dygraph.to_variable(label_data)
  17. #前向计算的过程,同时拿到模型输出值和分类准确率
  18. predict, avg_acc = model(image, label)
  19. #计算损失,取一个批次样本损失的平均值
  20. loss = fluid.layers.cross_entropy(predict, label)
  21. avg_loss = fluid.layers.mean(loss)
  22. #每训练了100批次的数据,打印下当前Loss的情况
  23. if batch_id % 100 == 0:
  24. print("epoch: {}, batch: {}, loss is: {}, acc is {}".format(epoch_id, batch_id, avg_loss.numpy(), avg_acc.numpy()))
  25. data_writer.add_scalar("train/loss", avg_loss.numpy(), iter)
  26. data_writer.add_scalar("train/accuracy", avg_acc.numpy(), iter)
  27. iter = iter + 100
  28. #后向传播,更新参数的过程
  29. avg_loss.backward()
  30. optimizer.minimize(avg_loss)
  31. model.clear_gradients()
  32. #保存模型参数
  33. fluid.save_dygraph(model.state_dict(), 'mnist')
  1. epoch: 0, batch: 0, loss is: [3.5700877], acc is [0.12]
  2. epoch: 0, batch: 100, loss is: [0.771968], acc is [0.78]
  3. epoch: 0, batch: 200, loss is: [0.4784001], acc is [0.84]
  4. epoch: 0, batch: 300, loss is: [0.4138908], acc is [0.86]
  5. epoch: 0, batch: 400, loss is: [0.21223931], acc is [0.94]
  6. epoch: 1, batch: 0, loss is: [0.21852568], acc is [0.94]
  7. epoch: 1, batch: 100, loss is: [0.23175766], acc is [0.93]
  8. epoch: 1, batch: 200, loss is: [0.17594911], acc is [0.94]
  9. epoch: 1, batch: 300, loss is: [0.21336608], acc is [0.94]
  10. epoch: 1, batch: 400, loss is: [0.22092941], acc is [0.93]
  11. epoch: 2, batch: 0, loss is: [0.17480257], acc is [0.95]
  12. epoch: 2, batch: 100, loss is: [0.20100929], acc is [0.96]
  13. epoch: 2, batch: 200, loss is: [0.22063646], acc is [0.95]
  14. epoch: 2, batch: 300, loss is: [0.20135267], acc is [0.97]
  15. epoch: 2, batch: 400, loss is: [0.14431885], acc is [0.96]
  16. epoch: 3, batch: 0, loss is: [0.13094752], acc is [0.96]
  17. epoch: 3, batch: 100, loss is: [0.14637549], acc is [0.97]
  18. epoch: 3, batch: 200, loss is: [0.1629239], acc is [0.94]
  19. epoch: 3, batch: 300, loss is: [0.12996376], acc is [0.96]
  20. epoch: 3, batch: 400, loss is: [0.13180453], acc is [0.97]
  21. epoch: 4, batch: 0, loss is: [0.07111011], acc is [0.99]
  22. epoch: 4, batch: 100, loss is: [0.14352968], acc is [0.95]
  23. epoch: 4, batch: 200, loss is: [0.069472], acc is [0.98]
  24. epoch: 4, batch: 300, loss is: [0.09640435], acc is [0.96]
  25. epoch: 4, batch: 400, loss is: [0.06323731], acc is [0.99]
  26. epoch: 5, batch: 0, loss is: [0.1092354], acc is [0.96]
  27. epoch: 5, batch: 100, loss is: [0.17129269], acc is [0.93]
  28. epoch: 5, batch: 200, loss is: [0.15895134], acc is [0.95]
  29. epoch: 5, batch: 300, loss is: [0.10550124], acc is [0.97]
  30. epoch: 5, batch: 400, loss is: [0.13810474], acc is [0.97]
  31. epoch: 6, batch: 0, loss is: [0.08459067], acc is [0.97]
  32. epoch: 6, batch: 100, loss is: [0.16261597], acc is [0.95]
  33. epoch: 6, batch: 200, loss is: [0.08734676], acc is [0.99]
  34. epoch: 6, batch: 300, loss is: [0.04732138], acc is [0.99]
  35. epoch: 6, batch: 400, loss is: [0.05048716], acc is [1.]
  36. epoch: 7, batch: 0, loss is: [0.07575837], acc is [0.99]
  37. epoch: 7, batch: 100, loss is: [0.0630436], acc is [0.97]
  38. epoch: 7, batch: 200, loss is: [0.0532419], acc is [0.98]
  39. epoch: 7, batch: 300, loss is: [0.04099752], acc is [0.99]
  40. epoch: 7, batch: 400, loss is: [0.04924869], acc is [0.99]
  41. epoch: 8, batch: 0, loss is: [0.09146225], acc is [0.98]
  42. epoch: 8, batch: 100, loss is: [0.11323112], acc is [0.97]
  43. epoch: 8, batch: 200, loss is: [0.08802282], acc is [0.97]
  44. epoch: 8, batch: 300, loss is: [0.09727297], acc is [0.96]
  45. epoch: 8, batch: 400, loss is: [0.04420049], acc is [0.99]
  46. epoch: 9, batch: 0, loss is: [0.15196942], acc is [0.95]
  47. epoch: 9, batch: 100, loss is: [0.08296236], acc is [0.97]
  48. epoch: 9, batch: 200, loss is: [0.0836976], acc is [0.98]
  49. epoch: 9, batch: 300, loss is: [0.08159852], acc is [0.97]
  50. epoch: 9, batch: 400, loss is: [0.09775886], acc is [0.96]
  • 步骤3:命令行启动 tensorboard。

使用“tensorboard —logdir [数据文件所在文件夹路径] 的命令启动Tensor board。在Tensor board启动后,命令行会打印出可用浏览器查阅图形结果的网址。

  1. $ tensorboard --logdir log/data
  • 步骤4:打开浏览器,查看作图结果,如 图6 所示。

查阅的网址在第三步的启动命令后会打印出来(如TensorBoard 2.0.0 at http://localhost:6006/),将该网址输入浏览器地址栏刷新页面的效果如下图所示。除了右侧对数据点的作图外,左侧还有一个控制板,可以调整诸多作图的细节。

【手写数字识别】之训练调试与优化 - 图7

图6:tb-paddle的作图示例

作业题 2-4

  • 将普通神经网络模型的每层输出打印,观察内容。
  • 将分类准确率的指标 用PLT库画图表示。
  • 通过分类准确率,判断以采用不同损失函数训练模型的效果优劣。
  • 作图比较:随着训练进行,模型在训练集和测试集上的Loss曲线。
  • 调节正则化权重,观察4的作图曲线的变化,并分析原因。