训练过程
上述计算过程描述了如何构建神经网络,通过神经网络完成预测值和损失函数的计算。接下来介绍如何求解参数
和 的数值,这个过程也称为模型训练过程。训练过程是深度学习模型的关键要素之一,其目标是让定义的损失函数 尽可能的小,也就是说找到一个参数解 和 使得损失函数取得极小值。
我们先做一个小测试:如 图5 所示,基于微积分知识,求一条曲线在某个点的斜率等于函数该点的导数值。那么大家思考下,当处于曲线的极值点时,该点的斜率是多少?
图5:曲线斜率等于导数值
这个问题并不难回答,处于曲线极值点时的斜率为0,即函数在极值点处的导数为0。那么,让损失函数取极小值的
和 应该是下述方程组的解:
将样本数据
带入上面的方程组中即可求解出 和 的值,但是这种方法只对线性回归这样简单的任务有效。如果模型中含有非线性变换,或者损失函数不是均方差这种简单的形式,则很难通过上式求解。为了解决这个问题,下面我们将引入更加普适的数值求解方法:梯度下降法。
梯度下降法
在现实中存在大量的函数正向求解容易,反向求解较难,被称为单向函数。这种函数在密码学中有大量的应用,密码锁的特点是可以迅速判断一个密钥是否是正确的(已知
,求 很容易),但是即使获取到密码锁系统,无法破解出正确的密钥是什么(已知 ,求 很难)。
这种情况特别类似于一位想从山峰走到坡谷的盲人,他看不见坡谷在哪(无法逆向求解出$Loss&导数为0时的参数值),但可以伸脚探索身边的坡度(当前点的导数值,也称为梯度)。那么,求解Loss函数最小值可以“从当前的参数取值,一步步的按照下坡的方向下降,直到走到最低点”实现。这种方法笔者个人称它为“盲人下坡法”。哦不,有个更正式的说法“梯度下降法”。
训练的关键是找到一组
,使得损失函数 取极小值。我们先看一下损失函数 只随两个参数 、 变化时的简单情形,启发下寻解的思路。
这里我们将
中除 之外的参数和 都固定下来,可以用图画出 的形式。
net = Network(13)
losses = []
#只画出参数w5和w9在区间[-160, 160]的曲线部分,以及包含损失函数的极值
w5 = np.arange(-160.0, 160.0, 1.0)
w9 = np.arange(-160.0, 160.0, 1.0)
losses = np.zeros([len(w5), len(w9)])
#计算设定区域内每个参数取值所对应的Loss
for i in range(len(w5)):
for j in range(len(w9)):
net.w[5] = w5[i]
net.w[9] = w9[j]
z = net.forward(x)
loss = net.loss(z, y)
losses[i, j] = loss
#使用matplotlib将两个变量和对应的Loss作3D图
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure()
ax = Axes3D(fig)
w5, w9 = np.meshgrid(w5, w9)
ax.plot_surface(w5, w9, losses, rstride=1, cstride=1, cmap='rainbow')
plt.show()
- <Figure size 432x288 with 1 Axes>
对于这种简单情形,我们利用上面的程序,可以在三维空间中画出损失函数随参数变化的曲面图。从图中可以看出有些区域的函数值明显比周围的点小。
需要说明的是:为什么这里我们选择
和 来画图?这是因为选择这两个参数的时候,可比较直观的从损失函数的曲面图上发现极值点的存在。其他参数组合,从图形上观测损失函数的极值点不够直观。
观察上述曲线呈现出“圆滑”的坡度,这正是我们选择以均方误差作为损失函数的原因之一。图6 呈现了只有一个参数维度时,均方误差和绝对值误差(只将每个样本的误差累加,不做平方处理)的损失函数曲线图。
图6:均方误差和绝对值误差损失函数曲线图
由此可见,均方误差表现的“圆滑”的坡度有两个好处:
- 曲线的最低点是可导的。
- 越接近最低点,曲线的坡度逐渐放缓,有助与基于当前的梯度来判断接近最低点的程度(是否逐渐减少步长,以免错过最低点)。
而这两个特性绝对值误差是不具备的,这也是损失函数的设计不仅仅要考虑“合理性”,还要追求“易解性”的原因。
现在我们要找出一组
的值,使得损失函数最小,实现梯度下降法的方案如下:
步骤1:随机的选一组初始值,例如:
步骤2:选取下一个点 ,使得
步骤3:重复步骤2,直到损失函数几乎不再下降。
如何选择
是至关重要的,第一要保证 是下降的,第二要使得下降的趋势尽可能的快。微积分的基础知识告诉我们,沿着梯度的反方向,是函数值下降最快的方向,如 图7 所示。简单理解,函数在某一个点的梯度方向是曲线斜率最大的方向,但梯度方向是向上的,所以下降最快的是梯度的反方向。
图7:梯度下降方向示意图
计算梯度
上面我们讲过了损失函数的计算方法,这里稍微加以改写。为了梯度计算更加简洁,引入因子
,定义损失函数如下:
其中
是网络对第 个样本的预测值:
梯度的定义:
可以计算出
对 和 的偏导数:
从导数的计算过程可以看出,因子
被消掉了,这是因为二次函数求导的时候会产生因子 ,这也是我们将损失函数改写的原因。
下面我们考虑只有一个样本的情况下,计算梯度:
可以计算出:
可以计算出
对 和 的偏导数:
可以通过具体的程序查看每个变量的数据和维度。
x1 = x[0]
y1 = y[0]
z1 = net.forward(x1)
print('x1 {}, shape {}'.format(x1, x1.shape))
print('y1 {}, shape {}'.format(y1, y1.shape))
print('z1 {}, shape {}'.format(z1, z1.shape))
- x1 [-0.02146321 0.03767327 -0.28552309 -0.08663366 0.01289726 0.04634817
- 0.00795597 -0.00765794 -0.25172191 -0.11881188 -0.29002528 0.0519112
- -0.17590923], shape (13,)
- y1 [-0.00390539], shape (1,)
- z1 [-12.05947643], shape (1,)
按上面的公式,当只有一个样本时,可以计算某个
,比如 的梯度。
gradient_w0 = (z1 - y1) * x1[0]
print('gradient_w0 {}'.format(gradient_w0))
- gradient_w0 [0.25875126]
同样我们可以计算
的梯度。
gradient_w1 = (z1 - y1) * x1[1]
print('gradient_w1 {}'.format(gradient_w1))
- gradient_w1 [-0.45417275]
依次计算
的梯度。
gradient_w2= (z1 - y1) * x1[2]
print('gradient_w1 {}'.format(gradient_w2))
- gradient_w1 [3.44214394]
聪明的读者可能已经想到,写一个for循环即可计算从
到 的所有权重的梯度,该方法读者可以自行实现。
使用Numpy进行梯度计算
基于Numpy广播机制(对向量和矩阵计算如同对1个单一变量计算一样),可以更快速的实现梯度计算。计算梯度的代码中直接用(z1 - y1) * x1,得到的是一个13维的向量,每个分量分别代表该维度的梯度。
gradient_w = (z1 - y1) * x1
print('gradient_w_by_sample1 {}, gradient.shape {}'.format(gradient_w, gradient_w.shape))
- gradient_w_by_sample1 [ 0.25875126 -0.45417275 3.44214394 1.04441828 -0.15548386 -0.55875363
- -0.09591377 0.09232085 3.03465138 1.43234507 3.49642036 -0.62581917
- 2.12068622], gradient.shape (13,)
输入数据中有多个样本,每个样本都对梯度有贡献。如上代码计算了只有样本1时的梯度值,同样的计算方法也可以计算样本2和样本3对梯度的贡献。
x2 = x[1]
y2 = y[1]
z2 = net.forward(x2)
gradient_w = (z2 - y2) * x2
print('gradient_w_by_sample2 {}, gradient.shape {}'.format(gradient_w, gradient_w.shape))
- gradient_w_by_sample2 [ 0.7329239 4.91417754 3.33394253 2.9912385 4.45673435 -0.58146277
- -5.14623287 -2.4894594 7.19011988 7.99471607 0.83100061 -1.79236081
- 2.11028056], gradient.shape (13,)
x3 = x[2]
y3 = y[2]
z3 = net.forward(x3)
gradient_w = (z3 - y3) * x3
print('gradient_w_by_sample3 {}, gradient.shape {}'.format(gradient_w, gradient_w.shape))
- gradient_w_by_sample3 [ 0.25138584 1.68549775 1.14349809 1.02595515 1.5286008 -1.93302947
- 0.4058236 -0.85385157 2.46611579 2.74208162 0.28502219 -0.46695229
- 2.39363651], gradient.shape (13,)
可能有的读者再次想到可以使用for循环把每个样本对梯度的贡献都计算出来,然后再作平均。但是我们不需要这么做,仍然可以使用Numpy的矩阵操作来简化运算,如3个样本的情况。
# 注意这里是一次取出3个样本的数据,不是取出第3个样本
x3samples = x[0:3]
y3samples = y[0:3]
z3samples = net.forward(x3samples)
print('x {}, shape {}'.format(x3samples, x3samples.shape))
print('y {}, shape {}'.format(y3samples, y3samples.shape))
print('z {}, shape {}'.format(z3samples, z3samples.shape))
- x [[-0.02146321 0.03767327 -0.28552309 -0.08663366 0.01289726 0.04634817
- 0.00795597 -0.00765794 -0.25172191 -0.11881188 -0.29002528 0.0519112
- -0.17590923]
- [-0.02122729 -0.14232673 -0.09655922 -0.08663366 -0.12907805 0.0168406
- 0.14904763 0.0721009 -0.20824365 -0.23154675 -0.02406783 0.0519112
- -0.06111894]
- [-0.02122751 -0.14232673 -0.09655922 -0.08663366 -0.12907805 0.1632288
- -0.03426854 0.0721009 -0.20824365 -0.23154675 -0.02406783 0.03943037
- -0.20212336]], shape (3, 13)
- y [[-0.00390539]
- [-0.05723872]
- [ 0.23387239]], shape (3, 1)
- z [[-12.05947643]
- [-34.58467747]
- [-11.60858134]], shape (3, 1)
上面的x3samples, y3samples, z3samples的第一维大小均为3,表示有3个样本。下面计算这3个样本对梯度的贡献。
gradient_w = (z3samples - y3samples) * x3samples
print('gradient_w {}, gradient.shape {}'.format(gradient_w, gradient_w.shape))
- gradient_w [[ 0.25875126 -0.45417275 3.44214394 1.04441828 -0.15548386 -0.55875363
- -0.09591377 0.09232085 3.03465138 1.43234507 3.49642036 -0.62581917
- 2.12068622]
- [ 0.7329239 4.91417754 3.33394253 2.9912385 4.45673435 -0.58146277
- -5.14623287 -2.4894594 7.19011988 7.99471607 0.83100061 -1.79236081
- 2.11028056]
- [ 0.25138584 1.68549775 1.14349809 1.02595515 1.5286008 -1.93302947
- 0.4058236 -0.85385157 2.46611579 2.74208162 0.28502219 -0.46695229
- 2.39363651]], gradient.shape (3, 13)
此处可见,计算梯度gradient_w的维度是
,并且其第1行与上面第1个样本计算的梯度gradient_w_by_sample1一致,第2行与上面第2个样本计算的梯度gradient_w_by_sample1一致,第3行与上面第3个样本计算的梯度gradient_w_by_sample1一致。这里使用矩阵操作,可能更加方便的对3个样本分别计算各自对梯度的贡献。
那么对于有N个样本的情形,我们可以直接使用如下方式计算出所有样本对梯度的贡献,这就是使用Numpy库广播功能带来的便捷。 小结一下这里使用Numpy库的广播功能:
- 一方面可以扩展参数的维度,代替for循环来计算1个样本对从w0 到w12 的所有参数的梯度。
- 另一方面可以扩展样本的维度,代替for循环来计算样本0到样本403对参数的梯度。
z = net.forward(x)
gradient_w = (z - y) * x
print('gradient_w shape {}'.format(gradient_w.shape))
print(gradient_w)
- gradient_w shape (404, 13)
- [[ 0.25875126 -0.45417275 3.44214394 ... 3.49642036 -0.62581917
- 2.12068622]
- [ 0.7329239 4.91417754 3.33394253 ... 0.83100061 -1.79236081
- 2.11028056]
- [ 0.25138584 1.68549775 1.14349809 ... 0.28502219 -0.46695229
- 2.39363651]
- ...
- [ 14.70025543 -15.10890735 36.23258734 ... 24.54882966 5.51071122
- 26.26098922]
- [ 9.29832217 -15.33146159 36.76629344 ... 24.91043398 -1.27564923
- 26.61808955]
- [ 19.55115919 -10.8177237 25.94192351 ... 17.5765494 3.94557661
- 17.64891012]]
上面gradient_w的每一行代表了一个样本对梯度的贡献。根据梯度的计算公式,总梯度是对每个样本对梯度贡献的平均值。
我们也可以使用Numpy的均值函数来完成此过程:
# axis = 0 表示把每一行做相加然后再除以总的行数
gradient_w = np.mean(gradient_w, axis=0)
print('gradient_w ', gradient_w.shape)
print('w ', net.w.shape)
print(gradient_w)
print(net.w)
- gradient_w (13,)
- w (13, 1)
- [ 1.59697064 -0.92928123 4.72726926 1.65712204 4.96176389 1.18068454
- 4.55846519 -3.37770889 9.57465893 10.29870662 1.3900257 -0.30152215
- 1.09276043]
- [[ 1.76405235e+00]
- [ 4.00157208e-01]
- [ 9.78737984e-01]
- [ 2.24089320e+00]
- [ 1.86755799e+00]
- [ 1.59000000e+02]
- [ 9.50088418e-01]
- [-1.51357208e-01]
- [-1.03218852e-01]
- [ 1.59000000e+02]
- [ 1.44043571e-01]
- [ 1.45427351e+00]
- [ 7.61037725e-01]]
我们使用numpy的矩阵操作方便的完成了gradient的计算,但引入了一个问题,gradient_w的形状是(13,),而w的维度是(13, 1)。导致该问题的原因是使用np.mean函数的时候消除了第0维。为了加减乘除等计算方便,gradient_w和w必须保持一致的形状。因此我们将gradient_w的维度也设置为(13, 1),代码如下:
gradient_w = gradient_w[:, np.newaxis]
print('gradient_w shape', gradient_w.shape)
- gradient_w shape (13, 1)
综合上面的讨论,计算梯度的代码如下所示。
z = net.forward(x)
gradient_w = (z - y) * x
gradient_w = np.mean(gradient_w, axis=0)
gradient_w = gradient_w[:, np.newaxis]
gradient_w
- array([[ 1.59697064],
- [-0.92928123],
- [ 4.72726926],
- [ 1.65712204],
- [ 4.96176389],
- [ 1.18068454],
- [ 4.55846519],
- [-3.37770889],
- [ 9.57465893],
- [10.29870662],
- [ 1.3900257 ],
- [-0.30152215],
- [ 1.09276043]])
上述代码非常简洁的完成了
的梯度计算。同样,计算 的梯度的代码也是类似的原理。
gradient_b = (z - y)
gradient_b = np.mean(gradient_b)
# 此处b是一个数值,所以可以直接用np.mean得到一个标量
gradient_b
- -1.0918438870293816e-13
将上面计算
和 的梯度的过程,写成Network类的gradient函数,代码如下所示。
class Network(object):
def __init__(self, num_of_weights):
# 随机产生w的初始值
# 为了保持程序每次运行结果的一致性,此处设置固定的随机数种子
np.random.seed(0)
self.w = np.random.randn(num_of_weights, 1)
self.b = 0.
def forward(self, x):
z = np.dot(x, self.w) + self.b
return z
def loss(self, z, y):
error = z - y
num_samples = error.shape[0]
cost = error * error
cost = np.sum(cost) / num_samples
return cost
def gradient(self, x, y):
z = self.forward(x)
gradient_w = (z-y)*x
gradient_w = np.mean(gradient_w, axis=0)
gradient_w = gradient_w[:, np.newaxis]
gradient_b = (z - y)
gradient_b = np.mean(gradient_b)
return gradient_w, gradient_b
# 调用上面定义的gradient函数,计算梯度
# 初始化网络,
net = Network(13)
# 设置[w5, w9] = [-100., +100.]
net.w[5] = -100.0
net.w[9] = -100.0
z = net.forward(x)
loss = net.loss(z, y)
gradient_w, gradient_b = net.gradient(x, y)
gradient_w5 = gradient_w[5][0]
gradient_w9 = gradient_w[9][0]
print('point {}, loss {}'.format([net.w[5][0], net.w[9][0]], loss))
print('gradient {}'.format([gradient_w5, gradient_w9]))
- point [-100.0, -100.0], loss 686.3005008179159
- gradient [-0.850073323995813, -6.138412364807849]
确定损失函数更小的点
下面我们开始研究更新梯度的方法。首先沿着梯度的反方向移动一小步,找到下一个点P1,观察损失函数的变化。
# 在[w5, w9]平面上,沿着梯度的反方向移动到下一个点P1
# 定义移动步长 eta
eta = 0.1
# 更新参数w5和w9
net.w[5] = net.w[5] - eta * gradient_w5
net.w[9] = net.w[9] - eta * gradient_w9
# 重新计算z和loss
z = net.forward(x)
loss = net.loss(z, y)
gradient_w, gradient_b = net.gradient(x, y)
gradient_w5 = gradient_w[5][0]
gradient_w9 = gradient_w[9][0]
print('point {}, loss {}'.format([net.w[5][0], net.w[9][0]], loss))
print('gradient {}'.format([gradient_w5, gradient_w9]))
- point [-99.91499266760042, -99.38615876351922], loss 678.6472185028845
- gradient [-0.8556356178645292, -6.0932268634065805]
运行上面的代码,可以发现沿着梯度反方向走一小步,下一个点的损失函数的确减少了。感兴趣的话,大家可以尝试不停的点击上面的代码块,观察损失函数是否一直在变小。
在上述代码中,每次更新参数使用的语句: net.w[5] = net.w[5] - eta * gradient_w5
- 相减:参数需要向梯度的反方向移动。
- eta:控制每次参数值沿着梯度反方向变动的大小,即每次移动的步长,又称为学习率。
大家可以思考下,为什么之前我们要做输入特征的归一化,保持尺度一致?这是为了让统一的步长更加合适。
如 图8 所示,特征输入归一化后,不同参数输出的Loss是一个比较规整的曲线,学习率可以设置成统一的值 ;特征输入未归一化时,不同特征对应的参数所需的步长不一致,尺度较大的参数需要大步长,尺寸较小的参数需要小步长,导致无法设置统一的学习率。
图8:未归一化的特征,会导致不同特征维度的理想步长不同
代码封装Train函数
将上面的循环的计算过程封装在train和update函数中,代码如下所示。
class Network(object):
def __init__(self, num_of_weights):
# 随机产生w的初始值
# 为了保持程序每次运行结果的一致性,此处设置固定的随机数种子
np.random.seed(0)
self.w = np.random.randn(num_of_weights,1)
self.w[5] = -100.
self.w[9] = -100.
self.b = 0.
def forward(self, x):
z = np.dot(x, self.w) + self.b
return z
def loss(self, z, y):
error = z - y
num_samples = error.shape[0]
cost = error * error
cost = np.sum(cost) / num_samples
return cost
def gradient(self, x, y):
z = self.forward(x)
gradient_w = (z-y)*x
gradient_w = np.mean(gradient_w, axis=0)
gradient_w = gradient_w[:, np.newaxis]
gradient_b = (z - y)
gradient_b = np.mean(gradient_b)
return gradient_w, gradient_b
def update(self, graident_w5, gradient_w9, eta=0.01):
net.w[5] = net.w[5] - eta * gradient_w5
net.w[9] = net.w[9] - eta * gradient_w9
def train(self, x, y, iterations=100, eta=0.01):
points = []
losses = []
for i in range(iterations):
points.append([net.w[5][0], net.w[9][0]])
z = self.forward(x)
L = self.loss(z, y)
gradient_w, gradient_b = self.gradient(x, y)
gradient_w5 = gradient_w[5][0]
gradient_w9 = gradient_w[9][0]
self.update(gradient_w5, gradient_w9, eta)
losses.append(L)
if i % 50 == 0:
print('iter {}, point {}, loss {}'.format(i, [net.w[5][0], net.w[9][0]], L))
return points, losses
# 获取数据
train_data, test_data = load_data()
x = train_data[:, :-1]
y = train_data[:, -1:]
# 创建网络
net = Network(13)
num_iterations=2000
# 启动训练
points, losses = net.train(x, y, iterations=num_iterations, eta=0.01)
# 画出损失函数的变化趋势
plot_x = np.arange(num_iterations)
plot_y = np.array(losses)
plt.plot(plot_x, plot_y)
plt.show()
- iter 0, point [-99.99144364382136, -99.93861587635192], loss 686.3005008179159
- iter 50, point [-99.56362583488914, -96.92631128470325], loss 649.221346830939
- iter 100, point [-99.13580802595692, -94.02279509580971], loss 614.6970095624063
- iter 150, point [-98.7079902170247, -91.22404911807594], loss 582.543755023494
- iter 200, point [-98.28017240809248, -88.52620357520894], loss 552.5911329872217
- iter 250, point [-97.85235459916026, -85.9255316243737], loss 524.6810152322887
- iter 300, point [-97.42453679022805, -83.41844407682491], loss 498.6667034691001
- iter 350, point [-96.99671898129583, -81.00148431353688], loss 474.4121018974464
- iter 400, point [-96.56890117236361, -78.67132338862874], loss 451.7909497114133
- iter 450, point [-96.14108336343139, -76.42475531364933], loss 430.68610920670284
- iter 500, point [-95.71326555449917, -74.25869251604028], loss 410.988905460488
- iter 550, point [-95.28544774556696, -72.17016146534513], loss 392.5985138460824
- iter 600, point [-94.85762993663474, -70.15629846096763], loss 375.4213919156372
- iter 650, point [-94.42981212770252, -68.21434557551346], loss 359.3707524354014
- iter 700, point [-94.0019943187703, -66.34164674796719], loss 344.36607459115214
- iter 750, point [-93.57417650983808, -64.53564402117185], loss 330.33265059761464
- iter 800, point [-93.14635870090586, -62.793873918279786], loss 317.2011651461846
- iter 850, point [-92.71854089197365, -61.11396395304264], loss 304.907305311265
- iter 900, point [-92.29072308304143, -59.49362926899678], loss 293.3913987080144
- iter 950, point [-91.86290527410921, -57.930669402782904], loss 282.5980778542974
- iter 1000, point [-91.43508746517699, -56.4229651670156], loss 272.47596883802515
- iter 1050, point [-91.00726965624477, -54.968475648286564], loss 262.9774025287022
- iter 1100, point [-90.57945184731255, -53.56523531604897], loss 254.05814669965383
- iter 1150, point [-90.15163403838034, -52.21135123828792], loss 245.67715754581488
- iter 1200, point [-89.72381622944812, -50.90500040003218], loss 237.796349191773
- iter 1250, point [-89.2959984205159, -49.6444271209092], loss 230.3803798866218
- iter 1300, point [-88.86818061158368, -48.42794056808474], loss 223.3964536766492
- iter 1350, point [-88.44036280265146, -47.2539123610643], loss 216.81413643451378
- iter 1400, point [-88.01254499371925, -46.12077426496303], loss 210.60518520483126
- iter 1450, point [-87.58472718478703, -45.027015968976976], loss 204.74338990147896
- iter 1500, point [-87.15690937585481, -43.9711829469081], loss 199.20442646183588
- iter 1550, point [-86.72909156692259, -42.95187439671279], loss 193.96572062803054
- iter 1600, point [-86.30127375799037, -41.96774125615467], loss 189.00632158541163
- iter 1650, point [-85.87345594905815, -41.017484291751295], loss 184.3067847442463
- iter 1700, point [-85.44563814012594, -40.0998522583068], loss 179.84906300239203
- iter 1750, point [-85.01782033119372, -39.21364012642417], loss 175.61640587468244
- iter 1800, point [-84.5900025222615, -38.35768737548557], loss 171.59326591927967
- iter 1850, point [-84.16218471332928, -37.530876349682856], loss 167.76521193253296
- iter 1900, point [-83.73436690439706, -36.73213067476985], loss 164.11884842217898
- iter 1950, point [-83.30654909546485, -35.96041373329276], loss 160.64174090423475
- <Figure size 432x288 with 1 Axes>
训练扩展到全部参数
为了能给读者直观的感受,上面演示的梯度下降的过程仅包含
和 两个参数,但房价预测的完整模型,必须要对所有参数 和 进行求解。这需要将Network中的update和train函数进行修改。由于不再限定参与计算的参数(所有参数均参与计算),修改之后的代码反而更加简洁。实现逻辑:“前向计算输出、根据输出和真实值计算Loss、基于Loss和输入计算梯度、根据梯度更新参数值”四个部分反复执行,直到到达参数最优点。具体代码如下所示。
class Network(object):
def __init__(self, num_of_weights):
# 随机产生w的初始值
# 为了保持程序每次运行结果的一致性,此处设置固定的随机数种子
np.random.seed(0)
self.w = np.random.randn(num_of_weights, 1)
self.b = 0.
def forward(self, x):
z = np.dot(x, self.w) + self.b
return z
def loss(self, z, y):
error = z - y
num_samples = error.shape[0]
cost = error * error
cost = np.sum(cost) / num_samples
return cost
def gradient(self, x, y):
z = self.forward(x)
gradient_w = (z-y)*x
gradient_w = np.mean(gradient_w, axis=0)
gradient_w = gradient_w[:, np.newaxis]
gradient_b = (z - y)
gradient_b = np.mean(gradient_b)
return gradient_w, gradient_b
def update(self, gradient_w, gradient_b, eta = 0.01):
self.w = self.w - eta * gradient_w
self.b = self.b - eta * gradient_b
def train(self, x, y, iterations=100, eta=0.01):
losses = []
for i in range(iterations):
z = self.forward(x)
L = self.loss(z, y)
gradient_w, gradient_b = self.gradient(x, y)
self.update(gradient_w, gradient_b, eta)
losses.append(L)
if (i+1) % 10 == 0:
print('iter {}, loss {}'.format(i, L))
return losses
# 获取数据
train_data, test_data = load_data()
x = train_data[:, :-1]
y = train_data[:, -1:]
# 创建网络
net = Network(13)
num_iterations=1000
# 启动训练
losses = net.train(x,y, iterations=num_iterations, eta=0.01)
# 画出损失函数的变化趋势
plot_x = np.arange(num_iterations)
plot_y = np.array(losses)
plt.plot(plot_x, plot_y)
plt.show()
- iter 9, loss 1.8984947314576224
- iter 19, loss 1.8031783384598725
- iter 29, loss 1.7135517565541092
- iter 39, loss 1.6292649416831264
- iter 49, loss 1.5499895293373231
- iter 59, loss 1.4754174896452612
- iter 69, loss 1.4052598659324693
- iter 79, loss 1.3392455915676864
- iter 89, loss 1.2771203802372915
- iter 99, loss 1.218645685090292
- iter 109, loss 1.1635977224791534
- iter 119, loss 1.111766556287068
- iter 129, loss 1.0629552390811503
- iter 139, loss 1.0169790065644477
- iter 149, loss 0.9736645220185994
- iter 159, loss 0.9328491676343147
- iter 169, loss 0.8943803798194307
- iter 179, loss 0.8581150257549611
- iter 189, loss 0.8239188186389669
- iter 199, loss 0.7916657692169988
- iter 209, loss 0.761237671346902
- iter 219, loss 0.7325236194855752
- iter 229, loss 0.7054195561163928
- iter 239, loss 0.6798278472589763
- iter 249, loss 0.6556568843183528
- iter 259, loss 0.6328207106387195
- iter 269, loss 0.6112386712285092
- iter 279, loss 0.59083508421862
- iter 289, loss 0.5715389327049418
- iter 299, loss 0.5532835757100347
- iter 309, loss 0.5360064770773406
- iter 319, loss 0.5196489511849665
- iter 329, loss 0.5041559244351539
- iter 339, loss 0.48947571154034963
- iter 349, loss 0.47555980568755696
- iter 359, loss 0.46236268171965056
- iter 369, loss 0.44984161152579916
- iter 379, loss 0.43795649088328303
- iter 389, loss 0.4266696770400226
- iter 399, loss 0.41594583637124666
- iter 409, loss 0.4057518014851036
- iter 419, loss 0.3960564371908221
- iter 429, loss 0.38683051477942226
- iter 439, loss 0.3780465941011246
- iter 449, loss 0.3696789129556087
- iter 459, loss 0.3617032833413179
- iter 469, loss 0.3540969941381648
- iter 479, loss 0.3468387198244131
- iter 489, loss 0.3399084348532937
- iter 499, loss 0.33328733333814486
- iter 509, loss 0.3269577537166779
- iter 519, loss 0.32090310808539985
- iter 529, loss 0.3151078159144129
- iter 539, loss 0.30955724187078903
- iter 549, loss 0.3042376374955925
- iter 559, loss 0.2991360864954391
- iter 569, loss 0.2942404534243286
- iter 579, loss 0.2895393355454012
- iter 589, loss 0.28502201767532415
- iter 599, loss 0.28067842982626157
- iter 609, loss 0.27649910747186535
- iter 619, loss 0.2724751542744919
- iter 629, loss 0.2685982071209627
- iter 639, loss 0.26486040332365085
- iter 649, loss 0.2612543498525749
- iter 659, loss 0.2577730944725093
- iter 669, loss 0.2544100986669443
- iter 679, loss 0.2511592122380609
- iter 689, loss 0.2480146494787638
- iter 699, loss 0.24497096681926708
- iter 709, loss 0.2420230418567802
- iter 719, loss 0.23916605368251415
- iter 729, loss 0.23639546442555454
- iter 739, loss 0.23370700193813704
- iter 749, loss 0.2310966435515475
- iter 759, loss 0.2285606008362593
- iter 769, loss 0.22609530530403904
- iter 779, loss 0.22369739499361888
- iter 789, loss 0.2213637018851542
- iter 799, loss 0.21909124009208833
- iter 809, loss 0.21687719478222933
- iter 819, loss 0.21471891178284025
- iter 829, loss 0.21261388782734392
- iter 839, loss 0.2105597614038757
- iter 849, loss 0.20855430416838638
- iter 859, loss 0.20659541288730932
- iter 869, loss 0.20468110187697833
- iter 879, loss 0.2028094959090178
- iter 889, loss 0.20097882355283644
- iter 899, loss 0.19918741092814596
- iter 909, loss 0.19743367584210875
- iter 919, loss 0.1957161222872899
- iter 929, loss 0.19403333527807176
- iter 939, loss 0.19238397600456975
- iter 949, loss 0.19076677728439412
- iter 959, loss 0.1891805392938162
- iter 969, loss 0.18762412556104593
- iter 979, loss 0.18609645920539716
- iter 989, loss 0.18459651940712488
- iter 999, loss 0.18312333809366155
- <Figure size 432x288 with 1 Axes>
随机梯度下降法( Stochastic Gradient Descent)
在上述程序中,每次损失函数和梯度计算都是基于数据集中的全量数据。对于波士顿房价预测任务数据集而言,样本数比较少,只有404个。但在实际问题中,数据集往往非常大,如果每次都使用全量数据进行计算,效率非常低,通俗的说就是“杀鸡焉用牛刀”。由于参数每次只沿着梯度反方向更新一点点,因此方向并不需要那么精确。一个合理的解决方案是每次从总的数据集中随机抽取出小部分数据来代表整体,基于这部分数据计算梯度和损失来更新参数,这种方法被称作随机梯度下降法(Stochastic Gradient Descent,SGD),核心概念如下:
- min-batch:每次迭代时抽取出来的一批数据被称为一个min-batch。
- batch_size:一个mini-batch所包含的样本数目称为batch_size。
- epoch:当程序迭代的时候,按mini-batch逐渐抽取出样本,当把整个数据集都遍历到了的时候,则完成了一轮的训练,也叫一个epoch。启动训练时,可以将训练的轮数num_epochs和batch_size作为参数传入。
下面结合程序介绍具体的实现过程,涉及到数据处理和训练过程两部分代码的修改。
数据处理代码修改
数据处理需要实现拆分数据批次和样本乱序(为了实现随机抽样的效果)两个功能。
# 获取数据
train_data, test_data = load_data()
train_data.shape
- (404, 14)
train_data中一共包含404条数据,如果batch_size=10,即取前0-9号样本作为第一个mini-batch,命名train_data1。
train_data1 = train_data[0:10]
train_data1.shape
- (10, 14)
使用train_data1的数据(0-9号样本)计算梯度并更新网络参数。
net = Network(13)
x = train_data1[:, :-1]
y = train_data1[:, -1:]
loss = net.train(x, y, iterations=1, eta=0.01)
loss
- [0.9001866101467375]
再取出10-19号样本作为第二个mini-batch,计算梯度并更新网络参数。
train_data2 = train_data[10:19]
x = train_data1[:, :-1]
y = train_data1[:, -1:]
loss = net.train(x, y, iterations=1, eta=0.01)
loss
- [0.8903272433979657]
按此方法不断的取出新的mini-batch,并逐渐更新网络参数。
接下来,将train_data分成大小为batch_size的多个mini_batch,如下代码所示:将train_data分成
个 mini_batch了,其中前40个mini_batch,每个均含有10个样本,最后一个mini_batch只含有4个样本。
batch_size = 10
n = len(train_data)
mini_batches = [train_data[k:k+batch_size] for k in range(0, n, batch_size)]
print('total number of mini_batches is ', len(mini_batches))
print('first mini_batch shape ', mini_batches[0].shape)
print('last mini_batch shape ', mini_batches[-1].shape)
- total number of mini_batches is 41
- first mini_batch shape (10, 14)
- last mini_batch shape (4, 14)
另外,我们这里是按顺序取出mini_batch的,而SGD里面是随机的抽取一部分样本代表总体。为了实现随机抽样的效果,我们先将train_data里面的样本顺序随机打乱,然后再抽取mini_batch。随机打乱样本顺序,需要用到np.random.shuffle函数,下面先介绍它的用法。
说明:
通过大量实验发现,模型对最后出现的数据印象更加深刻。训练数据导入后,越接近模型训练结束,最后几个批次数据对模型参数的影响越大。为了避免模型记忆影响训练效果,需要进行样本乱序操作。
# 新建一个array
a = np.array([1,2,3,4,5,6,7,8,9,10,11,12])
print('before shuffle', a)
np.random.shuffle(a)
print('after shuffle', a)
- before shuffle [ 1 2 3 4 5 6 7 8 9 10 11 12]
- after shuffle [ 7 2 11 3 8 6 12 1 4 5 10 9]
多次运行上面的代码,可以发现每次执行shuffle函数后的数字顺序均不同。 上面举的是一个1维数组乱序的案例,我们在观察下2维数组乱序后的效果。
# 新建一个array
a = np.array([1,2,3,4,5,6,7,8,9,10,11,12])
a = a.reshape([6, 2])
print('before shuffle\n', a)
np.random.shuffle(a)
print('after shuffle\n', a)
- before shuffle
- [[ 1 2]
- [ 3 4]
- [ 5 6]
- [ 7 8]
- [ 9 10]
- [11 12]]
- after shuffle
- [[ 1 2]
- [ 3 4]
- [ 5 6]
- [ 9 10]
- [11 12]
- [ 7 8]]
观察运行结果可发现,数组的元素在第0维被随机打乱,但第1维的顺序保持不变。例如数字2仍然紧挨在数字1的后面,数字8仍然紧挨在数字7的后面,而第二维的[3, 4]并不排在[1, 2]的后面。将这部分实现SGD算法的代码集成到Network类中的train函数中,最终的完整代码如下。
# 获取数据
train_data, test_data = load_data()
# 打乱样本顺序
np.random.shuffle(train_data)
# 将train_data分成多个mini_batch
batch_size = 10
n = len(train_data)
mini_batches = [train_data[k:k+batch_size] for k in range(0, n, batch_size)]
# 创建网络
net = Network(13)
# 依次使用每个mini_batch的数据
for mini_batch in mini_batches:
x = mini_batch[:, :-1]
y = mini_batch[:, -1:]
loss = net.train(x, y, iterations=1)
训练过程代码修改
将每个随机抽取的mini-batch数据输入到模型中用于参数训练。训练过程的核心是两层循环:
- 第一层循环,代表样本集合要被训练遍历几次,称为“epoch”,代码如下:
for epoch_id in range(num_epoches):
- 第二层循环,代表每次遍历时,样本集合被拆分成的多个批次,需要全部执行训练,称为“iter (iteration)”,
代码如下:for iter_id,mini_batch in emumerate(mini_batches):
在两层循环的内部是经典的四步训练流程:前向计算->计算损失->计算梯度->更新参数,这与大家之前所学是一致的,代码如下:
x = mini_batch[:, :-1]
y = mini_batch[:, -1:]
a = self.forward(x) #前向计算
loss = self.loss(a, y) #计算损失
gradient_w, gradient_b = self.gradient(x, y) #计算梯度
self.update(gradient_w, gradient_b, eta) #更新参数
将两部分改写的代码集成到Network类中的train函数中,最终的实现如下。
import numpy as np
class Network(object):
def __init__(self, num_of_weights):
# 随机产生w的初始值
# 为了保持程序每次运行结果的一致性,此处设置固定的随机数种子
#np.random.seed(0)
self.w = np.random.randn(num_of_weights, 1)
self.b = 0.
def forward(self, x):
z = np.dot(x, self.w) + self.b
return z
def loss(self, z, y):
error = z - y
num_samples = error.shape[0]
cost = error * error
cost = np.sum(cost) / num_samples
return cost
def gradient(self, x, y):
z = self.forward(x)
N = x.shape[0]
gradient_w = 1. / N * np.sum((z-y) * x, axis=0)
gradient_w = gradient_w[:, np.newaxis]
gradient_b = 1. / N * np.sum(z-y)
return gradient_w, gradient_b
def update(self, gradient_w, gradient_b, eta = 0.01):
self.w = self.w - eta * gradient_w
self.b = self.b - eta * gradient_b
def train(self, training_data, num_epoches, batch_size=10, eta=0.01):
n = len(training_data)
losses = []
for epoch_id in range(num_epoches):
# 在每轮迭代开始之前,将训练数据的顺序随机的打乱,
# 然后再按每次取batch_size条数据的方式取出
np.random.shuffle(training_data)
# 将训练数据进行拆分,每个mini_batch包含batch_size条的数据
mini_batches = [training_data[k:k+batch_size] for k in range(0, n, batch_size)]
for iter_id, mini_batch in enumerate(mini_batches):
#print(self.w.shape)
#print(self.b)
x = mini_batch[:, :-1]
y = mini_batch[:, -1:]
a = self.forward(x)
loss = self.loss(a, y)
gradient_w, gradient_b = self.gradient(x, y)
self.update(gradient_w, gradient_b, eta)
losses.append(loss)
print('Epoch {:3d} / iter {:3d}, loss = {:.4f}'.
format(epoch_id, iter_id, loss))
return losses
# 获取数据
train_data, test_data = load_data()
# 创建网络
net = Network(13)
# 启动训练
losses = net.train(train_data, num_epoches=50, batch_size=100, eta=0.1)
# 画出损失函数的变化趋势
plot_x = np.arange(len(losses))
plot_y = np.array(losses)
plt.plot(plot_x, plot_y)
plt.show()
- Epoch 0 / iter 0, loss = 0.6273
- Epoch 0 / iter 1, loss = 0.4835
- Epoch 0 / iter 2, loss = 0.5830
- Epoch 0 / iter 3, loss = 0.5466
- Epoch 0 / iter 4, loss = 0.2147
- Epoch 1 / iter 0, loss = 0.6645
- Epoch 1 / iter 1, loss = 0.4875
- Epoch 1 / iter 2, loss = 0.4707
- Epoch 1 / iter 3, loss = 0.4153
- Epoch 1 / iter 4, loss = 0.1402
- Epoch 2 / iter 0, loss = 0.5897
- Epoch 2 / iter 1, loss = 0.4373
- Epoch 2 / iter 2, loss = 0.4631
- Epoch 2 / iter 3, loss = 0.3960
- Epoch 2 / iter 4, loss = 0.2340
- Epoch 3 / iter 0, loss = 0.4139
- Epoch 3 / iter 1, loss = 0.5635
- Epoch 3 / iter 2, loss = 0.3807
- Epoch 3 / iter 3, loss = 0.3975
- Epoch 3 / iter 4, loss = 0.1207
- Epoch 4 / iter 0, loss = 0.3786
- Epoch 4 / iter 1, loss = 0.4474
- Epoch 4 / iter 2, loss = 0.4019
- Epoch 4 / iter 3, loss = 0.4352
- Epoch 4 / iter 4, loss = 0.0435
- Epoch 5 / iter 0, loss = 0.4387
- Epoch 5 / iter 1, loss = 0.3886
- Epoch 5 / iter 2, loss = 0.3182
- Epoch 5 / iter 3, loss = 0.4189
- Epoch 5 / iter 4, loss = 0.1741
- Epoch 6 / iter 0, loss = 0.3191
- Epoch 6 / iter 1, loss = 0.3601
- Epoch 6 / iter 2, loss = 0.4199
- Epoch 6 / iter 3, loss = 0.3289
- Epoch 6 / iter 4, loss = 1.2691
- Epoch 7 / iter 0, loss = 0.3202
- Epoch 7 / iter 1, loss = 0.2855
- Epoch 7 / iter 2, loss = 0.4129
- Epoch 7 / iter 3, loss = 0.3331
- Epoch 7 / iter 4, loss = 0.2218
- Epoch 8 / iter 0, loss = 0.2368
- Epoch 8 / iter 1, loss = 0.3457
- Epoch 8 / iter 2, loss = 0.3339
- Epoch 8 / iter 3, loss = 0.3812
- Epoch 8 / iter 4, loss = 0.0534
- Epoch 9 / iter 0, loss = 0.3567
- Epoch 9 / iter 1, loss = 0.4033
- Epoch 9 / iter 2, loss = 0.1926
- Epoch 9 / iter 3, loss = 0.2803
- Epoch 9 / iter 4, loss = 0.1557
- Epoch 10 / iter 0, loss = 0.3435
- Epoch 10 / iter 1, loss = 0.2790
- Epoch 10 / iter 2, loss = 0.3456
- Epoch 10 / iter 3, loss = 0.2076
- Epoch 10 / iter 4, loss = 0.0935
- Epoch 11 / iter 0, loss = 0.3024
- Epoch 11 / iter 1, loss = 0.2517
- Epoch 11 / iter 2, loss = 0.2797
- Epoch 11 / iter 3, loss = 0.2989
- Epoch 11 / iter 4, loss = 0.0301
- Epoch 12 / iter 0, loss = 0.2507
- Epoch 12 / iter 1, loss = 0.2563
- Epoch 12 / iter 2, loss = 0.2971
- Epoch 12 / iter 3, loss = 0.2833
- Epoch 12 / iter 4, loss = 0.0597
- Epoch 13 / iter 0, loss = 0.2827
- Epoch 13 / iter 1, loss = 0.2094
- Epoch 13 / iter 2, loss = 0.2417
- Epoch 13 / iter 3, loss = 0.2985
- Epoch 13 / iter 4, loss = 0.4036
- Epoch 14 / iter 0, loss = 0.3085
- Epoch 14 / iter 1, loss = 0.2015
- Epoch 14 / iter 2, loss = 0.1830
- Epoch 14 / iter 3, loss = 0.2978
- Epoch 14 / iter 4, loss = 0.0630
- Epoch 15 / iter 0, loss = 0.2342
- Epoch 15 / iter 1, loss = 0.2780
- Epoch 15 / iter 2, loss = 0.2571
- Epoch 15 / iter 3, loss = 0.1838
- Epoch 15 / iter 4, loss = 0.0627
- Epoch 16 / iter 0, loss = 0.1896
- Epoch 16 / iter 1, loss = 0.1966
- Epoch 16 / iter 2, loss = 0.2018
- Epoch 16 / iter 3, loss = 0.3257
- Epoch 16 / iter 4, loss = 0.1268
- Epoch 17 / iter 0, loss = 0.1990
- Epoch 17 / iter 1, loss = 0.2031
- Epoch 17 / iter 2, loss = 0.2662
- Epoch 17 / iter 3, loss = 0.2128
- Epoch 17 / iter 4, loss = 0.0133
- Epoch 18 / iter 0, loss = 0.1780
- Epoch 18 / iter 1, loss = 0.1575
- Epoch 18 / iter 2, loss = 0.2547
- Epoch 18 / iter 3, loss = 0.2544
- Epoch 18 / iter 4, loss = 0.2007
- Epoch 19 / iter 0, loss = 0.1657
- Epoch 19 / iter 1, loss = 0.2000
- Epoch 19 / iter 2, loss = 0.2045
- Epoch 19 / iter 3, loss = 0.2524
- Epoch 19 / iter 4, loss = 0.0632
- Epoch 20 / iter 0, loss = 0.1629
- Epoch 20 / iter 1, loss = 0.1895
- Epoch 20 / iter 2, loss = 0.2523
- Epoch 20 / iter 3, loss = 0.1896
- Epoch 20 / iter 4, loss = 0.0918
- Epoch 21 / iter 0, loss = 0.1583
- Epoch 21 / iter 1, loss = 0.2322
- Epoch 21 / iter 2, loss = 0.1567
- Epoch 21 / iter 3, loss = 0.2089
- Epoch 21 / iter 4, loss = 0.2035
- Epoch 22 / iter 0, loss = 0.2273
- Epoch 22 / iter 1, loss = 0.1427
- Epoch 22 / iter 2, loss = 0.1712
- Epoch 22 / iter 3, loss = 0.1826
- Epoch 22 / iter 4, loss = 0.2878
- Epoch 23 / iter 0, loss = 0.1685
- Epoch 23 / iter 1, loss = 0.1622
- Epoch 23 / iter 2, loss = 0.1499
- Epoch 23 / iter 3, loss = 0.2329
- Epoch 23 / iter 4, loss = 0.1486
- Epoch 24 / iter 0, loss = 0.1617
- Epoch 24 / iter 1, loss = 0.2083
- Epoch 24 / iter 2, loss = 0.1442
- Epoch 24 / iter 3, loss = 0.1740
- Epoch 24 / iter 4, loss = 0.1641
- Epoch 25 / iter 0, loss = 0.1159
- Epoch 25 / iter 1, loss = 0.2064
- Epoch 25 / iter 2, loss = 0.1690
- Epoch 25 / iter 3, loss = 0.1778
- Epoch 25 / iter 4, loss = 0.0159
- Epoch 26 / iter 0, loss = 0.1730
- Epoch 26 / iter 1, loss = 0.1861
- Epoch 26 / iter 2, loss = 0.1387
- Epoch 26 / iter 3, loss = 0.1486
- Epoch 26 / iter 4, loss = 0.1090
- Epoch 27 / iter 0, loss = 0.1393
- Epoch 27 / iter 1, loss = 0.1775
- Epoch 27 / iter 2, loss = 0.1564
- Epoch 27 / iter 3, loss = 0.1245
- Epoch 27 / iter 4, loss = 0.7611
- Epoch 28 / iter 0, loss = 0.1470
- Epoch 28 / iter 1, loss = 0.1211
- Epoch 28 / iter 2, loss = 0.1285
- Epoch 28 / iter 3, loss = 0.1854
- Epoch 28 / iter 4, loss = 0.5240
- Epoch 29 / iter 0, loss = 0.1740
- Epoch 29 / iter 1, loss = 0.0898
- Epoch 29 / iter 2, loss = 0.1392
- Epoch 29 / iter 3, loss = 0.1842
- Epoch 29 / iter 4, loss = 0.0251
- Epoch 30 / iter 0, loss = 0.0978
- Epoch 30 / iter 1, loss = 0.1529
- Epoch 30 / iter 2, loss = 0.1640
- Epoch 30 / iter 3, loss = 0.1503
- Epoch 30 / iter 4, loss = 0.0975
- Epoch 31 / iter 0, loss = 0.1399
- Epoch 31 / iter 1, loss = 0.1595
- Epoch 31 / iter 2, loss = 0.1209
- Epoch 31 / iter 3, loss = 0.1203
- Epoch 31 / iter 4, loss = 0.2008
- Epoch 32 / iter 0, loss = 0.1501
- Epoch 32 / iter 1, loss = 0.1310
- Epoch 32 / iter 2, loss = 0.1065
- Epoch 32 / iter 3, loss = 0.1489
- Epoch 32 / iter 4, loss = 0.0818
- Epoch 33 / iter 0, loss = 0.1401
- Epoch 33 / iter 1, loss = 0.1367
- Epoch 33 / iter 2, loss = 0.0970
- Epoch 33 / iter 3, loss = 0.1481
- Epoch 33 / iter 4, loss = 0.0711
- Epoch 34 / iter 0, loss = 0.1157
- Epoch 34 / iter 1, loss = 0.1050
- Epoch 34 / iter 2, loss = 0.1378
- Epoch 34 / iter 3, loss = 0.1505
- Epoch 34 / iter 4, loss = 0.0429
- Epoch 35 / iter 0, loss = 0.1096
- Epoch 35 / iter 1, loss = 0.1279
- Epoch 35 / iter 2, loss = 0.1715
- Epoch 35 / iter 3, loss = 0.0888
- Epoch 35 / iter 4, loss = 0.0473
- Epoch 36 / iter 0, loss = 0.1350
- Epoch 36 / iter 1, loss = 0.0781
- Epoch 36 / iter 2, loss = 0.1458
- Epoch 36 / iter 3, loss = 0.1288
- Epoch 36 / iter 4, loss = 0.0421
- Epoch 37 / iter 0, loss = 0.1083
- Epoch 37 / iter 1, loss = 0.0972
- Epoch 37 / iter 2, loss = 0.1513
- Epoch 37 / iter 3, loss = 0.1236
- Epoch 37 / iter 4, loss = 0.0366
- Epoch 38 / iter 0, loss = 0.1204
- Epoch 38 / iter 1, loss = 0.1341
- Epoch 38 / iter 2, loss = 0.1109
- Epoch 38 / iter 3, loss = 0.0905
- Epoch 38 / iter 4, loss = 0.3906
- Epoch 39 / iter 0, loss = 0.0923
- Epoch 39 / iter 1, loss = 0.1094
- Epoch 39 / iter 2, loss = 0.1295
- Epoch 39 / iter 3, loss = 0.1239
- Epoch 39 / iter 4, loss = 0.0684
- Epoch 40 / iter 0, loss = 0.1188
- Epoch 40 / iter 1, loss = 0.0984
- Epoch 40 / iter 2, loss = 0.1067
- Epoch 40 / iter 3, loss = 0.1057
- Epoch 40 / iter 4, loss = 0.4602
- Epoch 41 / iter 0, loss = 0.1478
- Epoch 41 / iter 1, loss = 0.0980
- Epoch 41 / iter 2, loss = 0.0921
- Epoch 41 / iter 3, loss = 0.1020
- Epoch 41 / iter 4, loss = 0.0430
- Epoch 42 / iter 0, loss = 0.0991
- Epoch 42 / iter 1, loss = 0.0994
- Epoch 42 / iter 2, loss = 0.1270
- Epoch 42 / iter 3, loss = 0.0988
- Epoch 42 / iter 4, loss = 0.1176
- Epoch 43 / iter 0, loss = 0.1286
- Epoch 43 / iter 1, loss = 0.1013
- Epoch 43 / iter 2, loss = 0.1066
- Epoch 43 / iter 3, loss = 0.0779
- Epoch 43 / iter 4, loss = 0.1481
- Epoch 44 / iter 0, loss = 0.0840
- Epoch 44 / iter 1, loss = 0.0858
- Epoch 44 / iter 2, loss = 0.1388
- Epoch 44 / iter 3, loss = 0.1000
- Epoch 44 / iter 4, loss = 0.0313
- Epoch 45 / iter 0, loss = 0.0896
- Epoch 45 / iter 1, loss = 0.1173
- Epoch 45 / iter 2, loss = 0.0916
- Epoch 45 / iter 3, loss = 0.1043
- Epoch 45 / iter 4, loss = 0.0074
- Epoch 46 / iter 0, loss = 0.1008
- Epoch 46 / iter 1, loss = 0.0915
- Epoch 46 / iter 2, loss = 0.0877
- Epoch 46 / iter 3, loss = 0.1139
- Epoch 46 / iter 4, loss = 0.0292
- Epoch 47 / iter 0, loss = 0.0679
- Epoch 47 / iter 1, loss = 0.0987
- Epoch 47 / iter 2, loss = 0.0929
- Epoch 47 / iter 3, loss = 0.1098
- Epoch 47 / iter 4, loss = 0.4838
- Epoch 48 / iter 0, loss = 0.0693
- Epoch 48 / iter 1, loss = 0.1095
- Epoch 48 / iter 2, loss = 0.1128
- Epoch 48 / iter 3, loss = 0.0890
- Epoch 48 / iter 4, loss = 0.1008
- Epoch 49 / iter 0, loss = 0.0724
- Epoch 49 / iter 1, loss = 0.0804
- Epoch 49 / iter 2, loss = 0.0919
- Epoch 49 / iter 3, loss = 0.1233
- Epoch 49 / iter 4, loss = 0.1849
- <Figure size 432x288 with 1 Axes>
观察上述Loss的变化,随机梯度下降加快了训练过程,但由于每次仅基于少量样本更新参数和计算损失,所以损失下降曲线会出现震荡。
说明:
由于房价预测的数据量过少,所以难以感受到随机梯度下降带来的性能提升。