In the Fall 2022, this course is taught by:
The videos below were recorded by Dr. Papernot but are applicable to both sections of the course.
You will find below a playlist of videos recorded for this course. Click on the video thumbnail to play it and/or maximize it. Next to each video, you can find either the slides or notes that I used to record the video. You can also find a list of optional reading. I voluntarily selected lots of reading options to allow you to explore different authors and find an exposition of machine learning that best matches your background and preparation for this course. You do not have to read all book chapters pointed out or watch all videos linked to in the reading list, but this should help you find a style of material that works best for you, which you can use to help you better understand the content of my video recordings for this course.
Background material: this is a list of links (list being populated) to help you refresh some of the background material required to understand the content of the following videos on ML
Tutorial material: this is a list of links to notes that will serve as the basis for tutorials this semester.
Interactive demos: this is a list of links to interactive demos that will help you form an intuition for algorithms we cover in this course.
An introduction to what distinguished machine learning from more traditional computer programs. Brief discussion of the different forms of ML: unsupervised, supervised, and reinforcement learning.
An introduction to unsupervised learning. Includes examples of clustering, dimensionality reduction, and data visualization. These three applications are motivate the two techniques presented in the rest of the module.
An introduction the the k-means algorithm, with the cluster assignment and centroid update steps. Concludes with an example showing how data that does not contain well-separated clusters opens up a few questions on the solutions found by the k-means algorithm. These are addressed in the next video.
The objective function behind the k-means algorithm allows us to choose better strategies for initializing the cluster centroids. We also go over two strategies for picking the number k of centroids the k-means algorithm is looking for.
This video covers PCA for dimensionality reduction. How to use SVD to perform PCA? How to compute the reconstruction error of PCA? How to choose the number of components?
https://stats.stackexchange.com/a/140579Supervised learning, linear regression, mean squared loss, direct solution through critical points
How to vectorize linear regression over multiple features per input? Also introduces the matrix notation required to perform linear regression over an entire data (vectorization across multiple examples) and compute the average cost over the data.
How to turn a linear regression model into a binary linear classification model using a threshold function? How to visualize simple binary functions (e.g., not, and, xor) in input and weight spaces to forge an intuition as to why they are linearly separable (e.g., not, and) or why they are not linearly separable (e.g., xor)?
A more in depth treatment of the non-linearly separable nature of XOR using a proof by contradiction and notions of convexity.
How to use the perceptron learning rule to set the weights of a binary linear classifier?
Geometric interpretation of the max-margin hyperplane, max-margin as an optimization problem.
Polynomial feature mappings for linear regression, under/overfitting, hyperparameters, generalization
Slack variables for SVMs on non linearly separable data or data that contains outliers.
Intuition for kernels based on similarity between points, Gaussian kernels, implications of hyperparameter choice for under/over fitting.
Introduces the concept of steepest descent to optimize iteratively the values of model parameters with the help of a loss funciton. Update rule for gradient descent explained with linear regression and least squares.
An interpretation of the least squares loss in the context of linear regression, based on maximum likelihood.
Compares the 0-1 loss, least squared error, and cross-entropy. Also introduces the logistic function as an activation function.
Introduces logistic cross entropy to avoid the numerical instabilities in cross entropy loss applied to logistic regression. Full derivation of the update rule for gradient descent on a logistic regression for regression and classification.
Generalization of a binary classifier to the multiclass setting, introduction of the softmax activation and the multiclass variant of cross-entropy.
Introduces the concepts of layers (composition of abstractions), activation functions, and the multilayer perceptron. Also illustrates the expressive power of a deep neural network on xor.
Introduces the concept of expressive power for a deep neural network and the universal approximation theorem. Outlines how expressive power stems from activation functions and depth rather than width.
Intuition behind backpropagation based on the chain rule. Introduces the concepts of a computational graph, forward and backward passes, and the backpropagation algorithm itself. Examples on linear regression and a multilayer neural network (with and without vectorized notation).
Counter-argument for why deep learning is not convex, and how to deal with saddle points, plateaux, and ravines with optimizers that take into account curvature. Finishes with a discussion of how to tune a learning rate.
Introduces the concept of minibatch sampling and stochastic gradient descent (SGD)
Overfitting, train/validation/test splits. Qualitative training/test curves as a function of the number of training examples, number of parameters, and number of epochs.
Bayes error (minimal risk), and bias-variance decomposition, visualization and connection to underfitting and overfitting.
Introduces techniques for reducing overfitting in neural network training: data augmentation, decreasing the number of parameters, weight decay, early stopping, ensembles, dropout.
Introduction to the convolution operation and layer.
Building blocks for convolutional neural networks: convolutional layer, pooling layers. Examples of convolutional neural networks on Mnist and Imagenet.
Recurrent architectures, backpropagation through time.
Language modeling, neural machine translation with encoder-decoder architectures.
Attention-based models for machine translation.
How gradients can explode or vanish when training RNNs with backpropagation through time. Solutions with gradient clipping and the LSTM architecture.
Acknowledgments. Material used in this course is adapted from several prior iterations of similar courses taught by others. This includes CSC321 by Prof. Grosse and Coursera's ML course by Prof. Ng.