Further with ML/DL

NYU K12 STEM Education: Machine Learning (Day 8)

Unsupervised learning

What if we don’t have labelled data for the given task?
The dataset still holds structure, we just don’t have access to it
Or what if there is a need to create data?
Example: Clustering, Generative AI, etc.
Let’s look at some unsupervised models

K-Means Clustering

Works by selecting ’k’ arbitrary centroids for clusters
Euclidian Distance is used to assign points to a cluster
(We can use other measures as well)
Centroids are updated and points are reassigned till convergence

*Figure 2: Clustering Problem Statement*

K-Means Drawbacks

What drawbacks can it have?
- k matters a lot!
- The algorithm depends heavily on the initial centroids
- categorical data doesn’t have a natural notion of distance or similarity

*Figure 3: K-Means Clustering Drawbacks*

K-Means Evaluation

How to evaluate the model? We don’t have any labels?
Inertia (\(J\)) measures the sum of squared distances between data points (\(x_i\)) and their assigned cluster centroids (\(\mu k\)).
Goal: Have low inertia!

Next Steps

What we have completed

Looked at foundational steps and models
- Regression tasks: (fish weight and housing prices)
- Classification tasks: (cancer prediction and Iris)
- Deep Neural Networks
- Convolutional Neural Networks

Building on these Topics

The goal with engineering is to take what we know and try to build on it.
In case of machine learning; can we use a CNN as a backbone to solve more complicated tasks with more complicated models?
The best way to do this is to independently learn from various sources.
What are some resources that we can use?

Resources to build on your learning

Finding Code: Github and Machine Learning Mastery
Finding Papers: There are many great conferences such as NeurIPS and ICML that are constantly publishing papers across topics and fields - ArXiv and SciHub
Theory: StatQuest, computerphile and 3 Blue 1 Brown on Youtube.

Supverised Learning

Object Detection

Faster-RCNN
YoLo
- Divides the image into nxn grid-cells
- For each grid cell,
  - predicts B bouding boxes and it box confidence score
  - Each box will have its class probability
  - All class probability are combined to detect one object

Semantic Segmentation

Every Pixel is associated with a class
Encoder-decoder structure
Decode using transposed convolution or deconvolution

Instance Segmentation

Autoencoders

Encoder-Decoder structure
Encoder helps in creating latent representations
Decoder helps in generating outputs from the latent representation

Autoencoder for denoising

Variational Autoencoders

Generative Models

Generate images, art, speech. Generation architectures can be modified based on the task at hand.

Benefits and Use Cases

When dataset collection is difficult or expensive. (For example MRI scans).
When there is a limit on available data. (With rare cancers, there may not be many positive cases.) (Start of COVID with few recorded cases.)
Various novel applications. (Generation or Art)

GANs: Generative Adversarial Networks

Invented in 2014 by Ian Goodfellow
Goal: generate samples never seen before
How: game between two networks
- Generator Network
- Discriminator Network
Goal of Generator: generate fake samples indistinguishable from real samples
Goal of Discriminator: be able to tell apart real and fake samples

Applications of GANs

Cats that Don’t Exist

Image Colorization

Image Synthesis

Image Super Resolution

How Would You Use ML/DL?

Think about potential applications with deep learning.
Discuss its social implications.

Can AI/ML be Biased?

At the start, a Neural Network just has randomly initialized weights.
It then trains and backprogates on a given dataset.
Do our nodes harbor any racism, sexism, homophobia or transphobia?
No!
Neural Networks aren’t sentient.
Neural Networks have no understanding of human emotions, biases or anything else.

Biased model outputs

PULSE is a face depixelizing algorithm, but…

Biases Inherent in Data: CheXpert
- CheXpert is a dataset of medical images in the form of Chest X-Rays. The dataset is inherently biased as when looking for rare diseases, most patients would test negative.
- More than 90% of sames are negative cases. As a result, a model can assume every patient is disease free and still have an accuracy of 90%
- As a result, models aren’t incentivized to learn about underrepresented classes in a dataset.
Biases not Inherent in Data: Celeb A
- In case of CheXpert, when studying rare diseases, it is more likely to not have the disease than having it.
- But sometimes, data in the real world isn’t biased but our dataset might be.
- Celeb-A is a great case for such a problem.
- Celeb-A: ”traditionally attractive”. predomintally white and cis. Heavy make-up. Potential Photoshop. 4K cameras.
- In the real world: Most people not models. People of Colour. Trans and Non-binary. Images aren’t taken on professional cameras with professional makeup.

Real World Biases leaking into Machine Learning

Keep in mind, the bias comes from Biased Data!! Not the model having any bigotry.
Bigotry and under-represented data in the real world can leak into machine learning.
- Biases and Racism in Law Enforcement can leak into model predictions. This only furthers existing inequity. AI in law enforcement
- Dataset Generation might often be predominantly white and cishet masculine with the sources of data and the engineers building these datasets not realizing the importance of diversity in datasets.

Insidious Effects on Machine Learning performance

(From the article Design AI so that it’s fair)

When Google Translate converts news articles written in Spanish into English, phrases referring to women often become ‘he said’ or ‘he wrote’.
Software designed to warn people using Nikon cameras when the person they are photographing seems to be blinking tends to interpret Asians as always blinking.
Google misclassifying people as gorillas,
Chat bot trained on data from tweets ”Tay” learns to be racist and sexist as a result of the sheer number bigoted twitter users.

How do we solve these issues?

Safety of AI

Boston Dynamics Parkour Atlas: What machine learning algorithms might have been used here?

The same model can have drastically different performance for different hyper-parameters.
100% accuracy is rarely achieved on unseen data.
Should we let a medical robot with CNN-based vision system perform surgery autonomously?
If a self-driving car crashes and hurts people, who should be responsible for it?

Carbon Footprint of Deep Learning

Course Takeaway

ML is the combination of math and computer science. We’ve only shown you a subsection
Supervised Learning: Linear/Logistic Regression and Neural Networks
Deep learning has wide applications, but we are also responsible for its consequences. —The greater the power, the greater the responsibility!

Demos

Customer Spending Clustering