In both linear and logistic regression the choice of the degree of the polynomial for the hypothesis function is extremely critical. A low degree for the polynomial can result in an underfit, while a very high degree can overfit the data as shown below
The figure on the left the data is underfit as we try to fit the data with a first order polynomial which is a straight line. This is a case of strong ‘bias’
The rightmost figure a much higher polynomial is used. All the data points are covered by the polynomial curve however it is not effective in predicting other values. This is a case of overfitting or a high variance.
The middle figure is just right as it intuitively fits the data points the best possible way.
A similar problem exists with logistic regression as shown below
There are 2 ways to handle overfitting
a) Reducing the number of features selected
b) Using regularization
In regularization the magnitude of the parameters Ɵ is decreased to reduce the effect of overfitting
Hence if we choose a hypothesis function
hƟ (x) = Ɵ0 + Ɵ1x12 + Ɵ2x22 + Ɵ3x33 + Ɵ4x44
The cost function for this without regularization as mentioned in earlier posts
J(Ɵ) = 1/2m Σ(hƟ (xi – yi)2
Where the key is minimize the above function for the least error
The cost function with regularization becomes
J(Ɵ) = 1/2m Σ(hƟ (xi – yi)2 + λ Σ Ɵj2
As can be seen the regularization now adds a factor Ɵj2 as a part of the cost function which needs to be minimized.
Hence with the regularization factor the problem of underfitting/overfitting can be solved
However the trick is determine the value of λ. If λ is too big then it would result in underfitting or resulting in a high bias.
Similarly the regularized equation for logistic regression is as shown below
J(Ɵ) = |1/m Σ -y * log(hƟ (x)) – (1-y) * (log(1 – hƟ (x)) | + λ/2m Σ Ɵj2
Some tips suggested by Prof Andrew Ng while determining the parameters and features for regression
a) Get as many training examples. It is worth spending more effort in getting as much examples
b) Add additional features
c) Observe changes to the learning algorithm with different values of λ
This post is continued in my next post – Simplifying ML: Impact of degree of polynomial on bias, variance and other insights
Note: This post, in line with my previous posts on Machine Learning, is based on the Coursera course on Machine Learning by Professor Andrew Ng
Find me on Google+