Chapter 4: Computing parameters analytically
Another option to minimize the cost function is to use normal equation, which uses straightforward derivatives to calculate for the values of parameters.
No feature scaling needed when using normal equation.
Gradient Descent vs Normal Equation: Normal equation doesn’t need to iterate and there is no need to choose a learning rate. Gradient descent works well when the number of features is large but normal equation is very slow when number of features is very large. So we can choose what algorithm to use based on the number of features in the training data set.
- Feature scaling allows the gradient descent to make less iterations as the shape is less skewed.