Compare Stochastic learning strategies for MLPClassifier

http://scikit-learn.org/stable/auto_examples/neural_networks/plot_mlp_training_curves.html#sphx-glr-auto-examples-neural-networks-plot-mlp-training-curves-py

此範例將畫出圖表,展現不同的訓練策略(optimizer)下loss curves的變化,訓練策略包括SGD與Adam。

1.Stochastic Gradient Descent(SGD):

.Stochastic Gradient Descent(SGD)為Gradient Descent(GD)的改良,在GD裡是輸入全部的training dataset,根據累積的loss才更新一次權重,因此收歛速度很慢,SGD隨機抽一筆 training sample,依照其 loss 更新權重。

2.Momentum:

Momentum是為了以防GD類的方法陷入局部最小值而衍生的方法,可以利用momentum降低陷入local minimum的機率,此方法是參考物理學動量的觀念。

看圖1藍色點的位置,當GD類的方法陷入局部最小值時,因為gd=0將會使電腦認為此處為最小值,於是為了減少此現象,每次更新時會將上次更新權重的一部分拿來加入此次更新。如紅色箭頭所示,將有機會翻過local minimum。

Ex 3: Compare Stochastic learning strategies for MLPClassifier - 图1

圖1:momentum觀念示意圖

3.Nesterov Momentum:

Nesterov Momentum為另外一種Momentum的變形體,目的也是降低陷入local minimum機率的方法,而兩種方法的差異在於下圖:

Ex 3: Compare Stochastic learning strategies for MLPClassifier - 图2

圖2:左圖為momentum,1.先計算 gradient、2.加上 momentum、3.更新權重

右圖為Nesterov Momentum,1.先加上momentum、2.計算gradient、3.更新權重。

圖2圖片來源:http://cs231n.github.io/neural-networks-3/

4.Adaptive Moment Estimation (Adam):

Adam為一種自己更新學習速率的方法,會根據GD計算出來的值調整每個參數的學習率(因材施教)。

以上所有的最佳化方法都將需要設定learning_rate_init值,此範例結果將呈現四種不同資料的比較:iris資料集、digits資料集、與使用sklearn.datasets產生資料集circlesmoon

(一)引入函式庫

  1. print(__doc__)
  2. import matplotlib.pyplot as plt
  3. from sklearn.neural_network import MLPClassifier
  4. from sklearn.preprocessing import MinMaxScaler
  5. from sklearn import datasets

(二)設定模型參數

  1. # different learning rate schedules and momentum parameters
  2. params = [{'solver': 'sgd', 'learning_rate': 'constant', 'momentum': 0,
  3. 'learning_rate_init': 0.2},
  4. {'solver': 'sgd', 'learning_rate': 'constant', 'momentum': .9,
  5. 'nesterovs_momentum': False, 'learning_rate_init': 0.2},
  6. {'solver': 'sgd', 'learning_rate': 'constant', 'momentum': .9,
  7. 'nesterovs_momentum': True, 'learning_rate_init': 0.2},
  8. {'solver': 'sgd', 'learning_rate': 'invscaling', 'momentum': 0,
  9. 'learning_rate_init': 0.2},
  10. {'solver': 'sgd', 'learning_rate': 'invscaling', 'momentum': .9,
  11. 'nesterovs_momentum': True, 'learning_rate_init': 0.2},
  12. {'solver': 'sgd', 'learning_rate': 'invscaling', 'momentum': .9,
  13. 'nesterovs_momentum': False, 'learning_rate_init': 0.2},
  14. {'solver': 'adam', 'learning_rate_init': 0.01}]
  15. labels = ["constant learning-rate", "constant with momentum",
  16. "constant with Nesterov's momentum",
  17. "inv-scaling learning-rate", "inv-scaling with momentum",
  18. "inv-scaling with Nesterov's momentum", "adam"]
  19. plot_args = [{'c': 'red', 'linestyle': '-'},
  20. {'c': 'green', 'linestyle': '-'},
  21. {'c': 'blue', 'linestyle': '-'},
  22. {'c': 'red', 'linestyle': '--'},
  23. {'c': 'green', 'linestyle': '--'},
  24. {'c': 'blue', 'linestyle': '--'},
  25. {'c': 'black', 'linestyle': '-'}]

(三)畫出loss curves

  1. def plot_on_dataset(X, y, ax, name):
  2. # for each dataset, plot learning for each learning strategy
  3. print("\nlearning on dataset %s" % name)
  4. ax.set_title(name)
  5. X = MinMaxScaler().fit_transform(X)
  6. mlps = []
  7. if name == "digits":
  8. # digits is larger but converges fairly quickly
  9. max_iter = 15
  10. else:
  11. max_iter = 400
  12. for label, param in zip(labels, params):
  13. print("training: %s" % label)
  14. mlp = MLPClassifier(verbose=0, random_state=0,
  15. max_iter=max_iter, **param)
  16. mlp.fit(X, y)
  17. mlps.append(mlp)
  18. print("Training set score: %f" % mlp.score(X, y))
  19. print("Training set loss: %f" % mlp.loss_)
  20. for mlp, label, args in zip(mlps, labels, plot_args):
  21. ax.plot(mlp.loss_curve_, label=label, **args)
  22. fig, axes = plt.subplots(2, 2, figsize=(15, 10))
  23. # load / generate some toy datasets
  24. iris = datasets.load_iris()
  25. digits = datasets.load_digits()
  26. data_sets = [(iris.data, iris.target),
  27. (digits.data, digits.target),
  28. datasets.make_circles(noise=0.2, factor=0.5, random_state=1),
  29. datasets.make_moons(noise=0.3, random_state=0)]
  30. for ax, data, name in zip(axes.ravel(), data_sets, ['iris', 'digits',
  31. 'circles', 'moons']):
  32. plot_on_dataset(*data, ax=ax, name=name)
  33. fig.legend(ax.get_lines(), labels=labels, ncol=3, loc="upper center")
  34. plt.show()

Ex 3: Compare Stochastic learning strategies for MLPClassifier - 图3

圖3:四種資料對於不同學習方法的loss curves下降比較圖

(四)完整程式碼

  1. print(__doc__)
  2. import matplotlib.pyplot as plt
  3. from sklearn.neural_network import MLPClassifier
  4. from sklearn.preprocessing import MinMaxScaler
  5. from sklearn import datasets
  6. # different learning rate schedules and momentum parameters
  7. params = [{'solver': 'sgd', 'learning_rate': 'constant', 'momentum': 0,
  8. 'learning_rate_init': 0.2},
  9. {'solver': 'sgd', 'learning_rate': 'constant', 'momentum': .9,
  10. 'nesterovs_momentum': False, 'learning_rate_init': 0.2},
  11. {'solver': 'sgd', 'learning_rate': 'constant', 'momentum': .9,
  12. 'nesterovs_momentum': True, 'learning_rate_init': 0.2},
  13. {'solver': 'sgd', 'learning_rate': 'invscaling', 'momentum': 0,
  14. 'learning_rate_init': 0.2},
  15. {'solver': 'sgd', 'learning_rate': 'invscaling', 'momentum': .9,
  16. 'nesterovs_momentum': True, 'learning_rate_init': 0.2},
  17. {'solver': 'sgd', 'learning_rate': 'invscaling', 'momentum': .9,
  18. 'nesterovs_momentum': False, 'learning_rate_init': 0.2},
  19. {'solver': 'adam', 'learning_rate_init': 0.01}]
  20. labels = ["constant learning-rate", "constant with momentum",
  21. "constant with Nesterov's momentum",
  22. "inv-scaling learning-rate", "inv-scaling with momentum",
  23. "inv-scaling with Nesterov's momentum", "adam"]
  24. plot_args = [{'c': 'red', 'linestyle': '-'},
  25. {'c': 'green', 'linestyle': '-'},
  26. {'c': 'blue', 'linestyle': '-'},
  27. {'c': 'red', 'linestyle': '--'},
  28. {'c': 'green', 'linestyle': '--'},
  29. {'c': 'blue', 'linestyle': '--'},
  30. {'c': 'black', 'linestyle': '-'}]
  31. def plot_on_dataset(X, y, ax, name):
  32. # for each dataset, plot learning for each learning strategy
  33. print("\nlearning on dataset %s" % name)
  34. ax.set_title(name)
  35. X = MinMaxScaler().fit_transform(X)
  36. mlps = []
  37. if name == "digits":
  38. # digits is larger but converges fairly quickly
  39. max_iter = 15
  40. else:
  41. max_iter = 400
  42. for label, param in zip(labels, params):
  43. print("training: %s" % label)
  44. mlp = MLPClassifier(verbose=0, random_state=0,
  45. max_iter=max_iter, **param)
  46. mlp.fit(X, y)
  47. mlps.append(mlp)
  48. print("Training set score: %f" % mlp.score(X, y))
  49. print("Training set loss: %f" % mlp.loss_)
  50. for mlp, label, args in zip(mlps, labels, plot_args):
  51. ax.plot(mlp.loss_curve_, label=label, **args)
  52. fig, axes = plt.subplots(2, 2, figsize=(15, 10))
  53. # load / generate some toy datasets
  54. iris = datasets.load_iris()
  55. digits = datasets.load_digits()
  56. data_sets = [(iris.data, iris.target),
  57. (digits.data, digits.target),
  58. datasets.make_circles(noise=0.2, factor=0.5, random_state=1),
  59. datasets.make_moons(noise=0.3, random_state=0)]
  60. for ax, data, name in zip(axes.ravel(), data_sets, ['iris', 'digits',
  61. 'circles', 'moons']):
  62. plot_on_dataset(*data, ax=ax, name=name)
  63. fig.legend(ax.get_lines(), labels=labels, ncol=3, loc="upper center")
  64. plt.show()