5-7 optimizers

There is a group of magic cooks in machine learning. Their daily life looks like:

They grab some raw material (data), put them into a pot (model), light some fire (optimization algorithm), and wait until the cuisine is ready.

However, anyone who has cooking experience knows that fire controlling is the key part. Even using same material with the same recipe, different fire level leads to totally different results: medium well, burnt, or still raw.

This theroy on cooking also applies to the machine learning. The choice of the optimization algorithm determines the final performance of the final model. An unsatisfying performance is not necessarily due to the problem of feature or model designing, instead, it might be attributed to the choice of optimization algorithm.

The evolution of the optimization algorithm for the deep learning is: SGD -> SGDM -> NAG ->Adagrad -> Adadelta(RMSprop) -> Adam -> Nadam

You may refer to the following article to for more details “Understand the differences in optimization algorthms with just one framework: SGD/AdaGrad/Adam”

For the beginners, choosing Adam as the optimizer and using the default parameters will set everything for you.

Some researchers who are chaising better metrics for publications could use Adam as the initial optimizer and use SGD later for fine-tuning the parameters for better performance.

There are some cutting-edge optimization algorithms claiming a better performance, e.g. LazyAdam, Look-ahead, RAdam, Ranger, etc.

1. How To Use the Optimizer

Optimizer accepts variables and corresponding gradient through apply_gradients method to iterate over the given variables. Another way is using minimize method to optimize the target function iteratively.

Another common way is passing the optimizer to the Model of keras, and call model.fit method to optimize the loss function.

A variable named optimizer.iterations will be created during optimizer initialization to record the number of iteration. Thus the optimizer should be created outside the decorator @tf.function with the same reason as tf.Variable.

  1. import tensorflow as tf
  2. import numpy as np
  3. # Time stamp
  4. @tf.function
  5. def printbar():
  6. ts = tf.timestamp()
  7. today_ts = ts%(24*60*60)
  8. hour = tf.cast(today_ts//3600+8,tf.int32)%tf.constant(24)
  9. minite = tf.cast((today_ts%3600)//60,tf.int32)
  10. second = tf.cast(tf.floor(today_ts%60),tf.int32)
  11. def timeformat(m):
  12. if tf.strings.length(tf.strings.format("{}",m))==1:
  13. return(tf.strings.format("0{}",m))
  14. else:
  15. return(tf.strings.format("{}",m))
  16. timestring = tf.strings.join([timeformat(hour),timeformat(minite),
  17. timeformat(second)],separator = ":")
  18. tf.print("=========="*8,end = "")
  19. tf.print(timestring)
  1. # The minimal value of f(x) = a*x**2 + b*x + c
  2. # Here we use optimizer.apply_gradients
  3. x = tf.Variable(0.0,name = "x",dtype = tf.float32)
  4. optimizer = tf.keras.optimizers.SGD(learning_rate=0.01)
  5. @tf.function
  6. def minimizef():
  7. a = tf.constant(1.0)
  8. b = tf.constant(-2.0)
  9. c = tf.constant(1.0)
  10. while tf.constant(True):
  11. with tf.GradientTape() as tape:
  12. y = a*tf.pow(x,2) + b*x + c
  13. dy_dx = tape.gradient(y,x)
  14. optimizer.apply_gradients(grads_and_vars=[(dy_dx,x)])
  15. # Condition of terminating the iteration
  16. if tf.abs(dy_dx)<tf.constant(0.00001):
  17. break
  18. if tf.math.mod(optimizer.iterations,100)==0:
  19. printbar()
  20. tf.print("step = ",optimizer.iterations)
  21. tf.print("x = ", x)
  22. tf.print("")
  23. y = a*tf.pow(x,2) + b*x + c
  24. return y
  25. tf.print("y =",minimizef())
  26. tf.print("x =",x)
  1. # Minimal value of f(x) = a*x**2 + b*x + c
  2. # Here we use optimizer.minimize
  3. x = tf.Variable(0.0,name = "x",dtype = tf.float32)
  4. optimizer = tf.keras.optimizers.SGD(learning_rate=0.01)
  5. def f():
  6. a = tf.constant(1.0)
  7. b = tf.constant(-2.0)
  8. c = tf.constant(1.0)
  9. y = a*tf.pow(x,2)+b*x+c
  10. return(y)
  11. @tf.function
  12. def train(epoch = 1000):
  13. for _ in tf.range(epoch):
  14. optimizer.minimize(f,[x])
  15. tf.print("epoch = ",optimizer.iterations)
  16. return(f())
  17. train(1000)
  18. tf.print("y = ",f())
  19. tf.print("x = ",x)
  1. # Minimal value of f(x) = a*x**2 + b*x + c
  2. # Here we use model.fit
  3. tf.keras.backend.clear_session()
  4. class FakeModel(tf.keras.models.Model):
  5. def __init__(self,a,b,c):
  6. super(FakeModel,self).__init__()
  7. self.a = a
  8. self.b = b
  9. self.c = c
  10. def build(self):
  11. self.x = tf.Variable(0.0,name = "x")
  12. self.built = True
  13. def call(self,features):
  14. loss = self.a*(self.x)**2+self.b*(self.x)+self.c
  15. return(tf.ones_like(features)*loss)
  16. def myloss(y_true,y_pred):
  17. return tf.reduce_mean(y_pred)
  18. model = FakeModel(tf.constant(1.0),tf.constant(-2.0),tf.constant(1.0))
  19. model.build()
  20. model.summary()
  21. model.compile(optimizer =
  22. tf.keras.optimizers.SGD(learning_rate=0.01),loss = myloss)
  23. history = model.fit(tf.zeros((100,2)),
  24. tf.ones(100),batch_size = 1,epochs = 10) # Iterate for 1000 times
  1. tf.print("x=",model.x)
  2. tf.print("loss=",model(tf.constant(0.0)))

2. Pre-defined Optimizers

The evolution of the optimization algorithm for the deep learning is: SGD -> SGDM -> NAG ->Adagrad -> Adadelta(RMSprop) -> Adam -> Nadam

There are corresponding classes in keras.optimizers sub-module as the implementations of these optimizers.

  • SGD, the default parameters is for a pure SGD. For a non-zero parameter momentum, the optimizer changes to SGDM since it considers the first-order momentum. For nesterov = True, the optimizer changes to NAG (Nesterov Accelerated Gradient), which calculates the gradient of the one further step.

  • Adagrad, considers the second-order momentum and equipted with self-adaptive learning rate; the drawback is a slow learning rate at a later stage or early ceasing of learning due to the monotonically desending leanring rate.

  • RMSprop, considers the second-order momentum and equipted with self-adaptive learning rate; improves the Adagrad through exponential smoothing, which only cnosiders the second-order momentum in a given window length.

  • Adadelta, considers the second-order momentum, similar as RMSprop but more complicated with an improved self-adaption.

  • Adam, consider both the first-order and the second-order momentum; it improves RMSprop by including first-order momentum.

  • Nadam, improves Adam by including Nesterov Acceleration.

Please leave comments in the WeChat official account “Python与算法之美” (Elegance of Python and Algorithms) if you want to communicate with the author about the content. The author will try best to reply given the limited time available.

You are also welcomed to join the group chat with the other readers through replying 加群 (join group) in the WeChat official account.

image.png