Overfitting and Regularization

NYU K12 STEM Education: Machine Learning (Day 3)

Polynomial Fitting

We have been using straight lines to fit our data. But it doesn’t work well every time
Some data have more complex relation that cannot be fitted well using a straight line

Polynomial: A sum of different powers of a variable

Examples: \(y = x^2 + 2, y = 5x^3 − 3x^2 + 4\)

Polynomials of \(x\): \(\hat{y} = w_0 + w_1x +w_2x^2 + w_3x^3 + \cdots + w_Mx^M\)
\(M\) is called the order of the polynomial.
The process of fitting a polynomial is similar to linearly fitting multivariate data.
Rewrite in matrix-vector form \[ \begin{bmatrix} \hat{y_1} \\ \hat{y_2} \\ \vdots \\ \hat{y_N} \end{bmatrix} = \begin{bmatrix}1 & x_{1} & x_{1}^2 & \cdots & x_{1}^M \\ 1 & x_{2} & x_{2}^2 & \cdots & x_{2}^M \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{N} & x_{N}^2 & \cdots & x_{N}^M\end{bmatrix} \begin{bmatrix}w_0 \\ w_1 \\ \vdots \\ w_M \end{bmatrix} \]
This can still be written as: \(\hat{Y} = Xw\)
Loss: \[ J(w) = \frac{1}{N} ||Y - Xw ||^2 \]
The \(i^{th}\) row of the design matrix \(X\) is simply a transformed feature: \[ \phi(x_i) = (1, x_i, x_i^2, \cdots, x_i^M) \]
Original Design Matrix: \[ \begin{bmatrix} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_N \end{bmatrix} \]
Design matrix after feature transformation: \[ \begin{bmatrix}1 & x_{1} & x_{1}^2 & \cdots & x_{1}^M \\ 1 & x_{2} & x_{2}^2 & \cdots & x_{2}^M \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{N} & x_{N}^2 & \cdots & x_{N}^M\end{bmatrix} \]
For the polynomial fitting, we just added columns of features that are powers of the original feature.

We learned how to fit our data using polynomials of different order
With a higher model order, we can fit the data with increasing accuracy
As you increase the model order, at certain point it is possible find a model that fits your data perfectly (ie. zero error)
What could be the problem?

Which of these model do you think is the best? Why?
The problem is that we are only fitting our model using data that is given
Data usually contains noise
When a model becomes too complex, it will start to fit the noise in the data
What happens if we apply our model to predict some data that the model has never seen before? It will not work well.
This is called over-fitting.

Split the data set into a train set and a test set
Train set will be used to train the model
The test set will not be seen by the model during the training process
Use test set to evaluate the model when a model is trained
With the training and test sets shown, which one do you think is the better model now?

New Cost Function: \[ J(w) = \frac{1}{N} ||Y - Xw ||^2 + \lambda ||w||^2 \]

Motivation: never determine a hyper-parameter based on training data
Hyper-Parameter: a parameter of the algorithm that is not a model-parameter solved for in optimization. Example: \(\lambda\) weight regularization value vs. model weights (\(w\))
Solution: split dataset into three
1. Training set: to compute the model-parameters (\(w\))
2. Validation set: to tune hyper-parameters (\(\lambda\))
3. Test set: to compute the performance of the algorithm (MSE)

Cannot rely on closed form solutions
- Computation efficiency: operations like inverting a matrix is not efficient
- For more complex problems such as neural networks, a closed-form solution is not always available
Need an optimization technique to find an optimal solution
- Machine learning practitioners use gradient-based methods

Update Rule:

\[ \text{Repeat\{} \\ w_{\text{new}} = w - \alpha \nabla J(w) \\ \text{\}} \]

Most loss function contours are not perfectly parabolic
Our goal is to find a solution that is very close to global minimum by the right choice of hyper-parameters