Linear Regression

NYU K12 STEM Education: Machine Learning (Day 2)

Statistics Review

In machine learning, a solid understanding of basic statistical concepts is essential for analyzing data and interpreting model results.

Mean

The mean, or average, is the sum of all the values in a dataset divided by the number of values. It provides a measure of the central tendency of the data.

Formula: \[ \text{Mean} (\mu) = \frac{1}{N} \sum_{i=1}^{N} x_i \]

Example:

For the dataset [2, 4, 6, 8, 10]: \[ \mu = \frac{2 + 4 + 6 + 8 + 10}{5} = 6 \]

Variance

Variance measures the spread of the data points around the mean. It indicates how much the data varies from the mean.

Formula: \[ \text{Variance} (\sigma^2) = \frac{1}{N} \sum_{i=1}^{N} (x_i - \mu)^2 \]

Example:

For the dataset [2, 4, 6, 8, 10]: \[ \sigma^2 = \frac{(2-6)^2 + (4-6)^2 + (6-6)^2 + (8-6)^2 + (10-6)^2}{5} = 8 \]

Mean and Variance Visualization

*Figure 1: Mean and Variance Visualized*

Figure 2: Wide Spread Dataset — *Figure 2: Wide Spread Data*

Standard Deviation

Standard deviation is the square root of the variance. It provides a measure of the spread of the data points in the same units as the data itself.

Formula: \[ \text{Standard Deviation} (\sigma) = \sqrt{\text{Variance}} \]

Example:

Using the variance calculated above: \[ \sigma = \sqrt{8} \approx 2.83 \]

Standard Deviation Visualization

*Figure 4: Standard Deviation visualized on wide spread dataset*

*Figure 5: Standard Deviation visualized on less spread dataset*

Covariance

Covariance measures the degree to which two variables change together. If the covariance is positive, the variables tend to increase together; if negative, one variable tends to increase when the other decreases.

Formula: \[ \text{Covariance} (\text{Cov}(X, Y)) = \frac{1}{N} \sum_{i=1}^{N} (x_i - \mu_X)(y_i - \mu_Y) \]

Example: For the datasets X = [2, 4, 6] and Y = [3, 6, 9]: \[ \text{Cov}(X, Y) = \frac{(2-4)(3-6) + (4-4)(6-6) + (6-4)(9-6)}{3} = 6 \]

Linear Regression

In a Nutshell…

Consider a function \(y = 2x + 1 \)
Here we introduce a new notation \(f (x) = 2x + 1 \)
What this means is that we have a function \(f (x ) \) which has \(x\) as its variable.
If we have different \(x\) values we will have different values of \(f(x)\). Example:
- For \(f (x) = 2x + 1 \) and setting \(x = 1\) we have \(f (x) = 3\)
- For \(f (x) = 2x + 1 \) and setting \(x = 0\) we have \(f (x) = 1\)
- For \(f (x) = 2x + 1 \) and setting \(x = −1.5\) we have \(f (x) = −2\)
We believe that dataset are representation of underlying models which can be represented as functions of features.
For example, we can build a model to forecast weather, we can use the features humidity, current temperature and wind speed to estimate what the temperature will be tomorrow.
Here we have \(f (x)\) representing the tomorrow’s temperature and \(x\) being a vector containing humidity, current temperature and wind speed.
But many times we do have \(f (x)\) available, our task here is to figure out what \(f (x)\) is using the data available to us.
Here \(f (x)\) is called a model.
In other words, we want to find a model that fits the data.
It would be easier to have a ”framework” of the model ready and find the model parameters using the data.
- \(f (x) = w_1x + w_0\)
- \(f (x) = w_2x^2 + w_1x + w_0\)
- \(f (x) = \frac{1}{e^{−(w_1x+w_0)} + 1}\)
The numbers \(w_0\), \(w_1\) and \(w_2\) are called model parameters.
We often write the model as \(f (x ; w)\), stacking all parameters to a vector \(w\).

Structure of a dataset

In a dataset we have many data.
We can represent each piece of data as \((x_i,y_i)\), \(i = 1,2,3,\cdots\)
\(x_i\) is called the feature and \(y_i\) is called the label.
The relationship between \(x_i\) and \(y_i\) and the model \(f\) is \(f (x_i ) = y_i\)
For example, if the weather forecast says it will be 21\(^{\circ}\)C (69.8\(^{\circ}\)F) if it turns out to be 22\(^{\circ}\)C (71.6\(^{\circ}\)F) you won’t be yelling at the TV.

How would you fit a line?

Can you find a line that passes through (0, 0) and (1, 1)?

The ”framework” of the model is \(f (x) = w_1x + w_0\)
The data is (\(x = 0\), \(f (x ) = y = 0\)) and (\(x = 1\), \(f (x ) = y = 1\)).
The process of finding a model to fit the data is to find the values of \(w_1\) and \(w_0\).

How would you fit a quadratic curve?

Can you find a quadratic curve that passes through (0, 0), (1, 1) and (−1, 1)?

The ”framework” of the model is \(f (x) = w_2x^2 + w_1x + w_0\)
The data is \((x = 0, f (x ) = y = 0)\), \((x = 1, f (x ) = y = 1)\) and \((x =−1,f(x)=y =1)\)
The process of finding a model to fit the data is to find the values of \(w_2\), \(w_1\) and \(w_0\).

Is Your Model a Good Fit?

How would you determine if your model is a good fit or not?
- How will you determine this?
- Is there a quantitative way?
We now introduce a new notation \(f(x_i) = \hat{y_i}\) here the \(\hat{\cdot}\) represents \(f(x_i)\) is a prediction of \(y_i\).

Error Functions

An error function quantifies the discrepancy between your model and the data.
- They are non-negative, and go to zero as the model gets better.
Common Error Functions:
- Mean Squared Error: \[ \text{MSE} = \frac{1}{N}\sum^{N}_{i=1}||y_i - \hat{y_i}||^2 \]
- Mean Absolute Error: \[ \text{MAE} = \frac{1}{N}\sum^{N}_{i=1}|y_i - \hat{y_i}| \]
In later units, we will refer to these as cost functions or loss functions.