Chapter 3: Multivariate Linear Regression
Feature scaling is the idea of making sure that the features are on a similar scale. This allows for the gradient descent algorithm to converge much faster as the contours take a more round shape. The idea is to have the values of the feature scaled in a way that they are in a closer range of numbers rather than a skewed range of values. For instance, have all features range from 0 to 1.
For sufficiently small learning rate, the cost function should decrease on very iteration. But if learning is too small, gradient descent can be very slow to converge. Similarity, if the learning rate is too big it will overshoot and more than likely fail to converge.
Feature scaling becomes even more important when wanting to fit a cubic or quadratic function through the points.