Linear Classification

NYU K12 STEM Education: Machine Learning (Day 4)

Classification

Given the dataset \((x_i , y_i)\) for \(i = 1, 2, \cdots , N\) , find a function \(f (x )\) (model) so that it can predict the label \(\hat{y}\) for some input \(x\), even if it is not in the dataset, i.e. \(\hat{y} = f (x)\)

Positive : \(y = 1\)
Negative : \(y = 0\)

Decision Boundary

Evaluation metric :

\[ \text{Accuracy} = \frac{\text{Number of correct prediction}}{\text{Total number of prediction}} \]

Need for a new model

What would happen if we used the linear regression model:

\[ \hat{y} = w_0 + w_1x \]
\(y\) is \(0\) or \(1\)
\(\hat{y}\) will take any value between \(-\infty\) and \(\infty\)
It will be hard to find \(w_0\) and \(w_1\) that make the prediction \(\hat{y}\) match the label \(y\)

Sigmoid Function

By appling the sigmoid function, we enforce \(0 \le \hat{y} \le 1\)

\[ \hat{y} = \text{sigmoid}(w_0 + w_1x) = \frac{1}{1 + e^{-(w_0 + w_1x)}} \]

Binary Cross Entropy Loss

Binary Cross Entropy Loss:

\[ \text{Loss} = \frac{1}{N} \sum^{N}_{i=1} \left[ -y_i \log \hat{y_i} - (1 - y_i) \log (1 - \hat{y_i}) \right] \]
If \(y_i = 0\):

\[ \left[ -y_i \log \hat{y_i} - (1 - y_i) \log (1 - \hat{y_i}) \right] = - \log (1 - \hat{y_i}) \]
If \(y_i = 1\):

\[ \left[ -y_i \log \hat{y_i} - (1 - y_i) \log (1 - \hat{y_i}) \right] = - \log (\hat{y_i}) \]

MSE vs Binary Cross Entropy Loss

MSE of a logistic function has many local minima
The Binary Cross Entropy loss has only one minimum

Classifier

\[ \hat{y} = \text{sigmoid}(w_0 + w_1x) = \frac{1}{1 + e^{-(w_0 + w_1x)}} \]

How to deal with uncertainty?
- Thanks to the sigmoid, \(\hat{y} = f (x)\) is between \(0\) and \(1\)
- If \(\hat{y}\) is close to \(0\), the data is probably negative
- If \(\hat{y}\) is close to \(1\), the data is probably positive
- If \(\hat{y}\) is around \(0.5\), we are not sure.

*Figure 4: Classifier with Sigmoid Shading*

Decision Boundary

Once, we have a classifier outputting a score \(0 < \hat{y} < 1\), we need to create a decision rule.
Let \(0 < t < 1\) be a threshold:
- If \(\hat{y} > t\), \(\hat{y}\) is classified as positive
- If \(\hat{y} < t\), \(\hat{y}\) is classified as negative

Performance metrics for a classifier:

Accuracy of a classifier: percentage of correct classification
Why accuracy alone is not a good measure for assessing the model?
- Example: A rare disease occurs 1 in ten thousand people
- A test that classifies everyone as free of the disease can achieve 99.999% accuracy when tested with people drawn randomly from the entire population

Types of Errors in Classification

Correct predictions:
- True Positive (TP) : Predict \(\hat{y} = 1\) when \(y = 1\)
- True Negative (TN) : Predict \(\hat{y} = 0\) when \(y = 0\)
Two types of errors:
- False Positive/ False Alarm (FP): \(\hat{y} = 1\) when \(y = 0\)
- False Negative/ Missed Detection (FN): \(\hat{y} = 0\) when \(y = 1\)

Metrics:

Sensitivity/Recall/TPR (How many positives are detected among all positive?)

\[ \frac{\text{TP}}{\text{TP} + \text{FN}} \]
Precision (How many detected positives are actually positive?)

\[ \frac{\text{TP}}{\text{TP} + \text{FP}} \]

Multiclass Classification

Previous Model:

\[ f(x) = \sigma(\phi(x)w) \]
Representing Multiple Classses:
- One-hot / 1-of-K vectors, ex : 4 Class
- Class 1 : \(y = [1,0,0,0]\)
- Class 2 : \(y = [0,1,0,0]\)
- Class 3 : \(y = [0,0,1,0]\)
- Class 4 : \(y = [0,0,0,1]\)
Multiple outputs:

\[ f(x) = \text{softmax}(\phi(x)W) \]
Shape of \(\phi(x)W\): \((N, K) = (N, D) \times (D, K) \)
Softmax:

\[ \text{softmax}(z_k) = \frac{e^{z_k}}{\sum_j e^{z_j}} \]
Softmax Example:

\[ z = \begin{bmatrix}-1 \\ 2 \\ 1 \\ -4 \end{bmatrix} \]

\[ \text{softmax}(z) = \begin{bmatrix}\frac{e^{-1}}{e^{-1} + e^{2} + e^{1} + e^{-4}} \\ \frac{e^{2}}{e^{-1} + e^{2} + e^{1} + e^{-4}} \\ \frac{e^{1}}{e^{-1} + e^{2} + e^{1} + e^{-4}} \\ \frac{e^{-4}}{e^{-1} + e^{2} + e^{1} + e^{-4}} \end{bmatrix} \approx \begin{bmatrix}0.035 \\ 0.704 \\ 0.259 \\ 0.002 \end{bmatrix} \]

Cross-entropy

Multple Outputs: \( \hat{y_i} = \text{softmax}(\phi(x_i)W)\)
Cross-Entropy:

\[ J(W) = - \sum^{N}_{i=1} \sum^{K}_{k=1} y_{ik} \log (\hat{y_{ik}}) \]
Example, \(K = 4\), if \(y_i = [0, 0, 1, 0]\) then,

\[ \sum^{K}_{k=1} y_{ik} \log (\hat{y_{ik}}) = \log (\hat{y_{i3}}) \]

Linear Classification

Classification

Decision Boundary

Need for a new model

Sigmoid Function

Binary Cross Entropy Loss

MSE vs Binary Cross Entropy Loss

Classifier

Decision Boundary

Performance metrics for a classifier:

Types of Errors in Classification

Metrics:

Multiclass Classification

Cross-entropy

Demos

References