Neural Networks and Deep Learning

Hui Lin @Google

Course Website

https://smi2021.scientistcafe.com/

Types of Neural Network

Figure adapted from slides by Andrew NG, Deep Learning Specialization

A Little Bit of History – Perceptron

First concept of a simplified brain cell (1943): McCullock-Pitts (MCP) neuron
Frank Rosenblatt published the first concept of the perceptron learning rule based on MCP (1957)
Classification of N points into 2 classes: -1 and +1 (i.e., two different colors in the picture below)
Fun video: https://www.youtube.com/watch?v=cNxadbrN_aI
Two features: ( and ); linear functions to separate classes; find (, , ) such that:

Perceptron Algorithm

Start with random weights. Set a maximum of epochs of M, for each epoch (permutation):

For each data point :
- Predict class label
- For every data point, we update the weight based on the prediction correctness, learning rate and feature values.

Calculate accuracy for the entire dataset to see whether it meets the criteria after each epoch.

Perceptron Algorithm

Perceptron algorithm is easy to implement in any modern programming language.
- Perceptron R notebook
It is a linear classification function, and the weight is updated after we feed each data point to the algorithm (a concept similar to stochastic gradient descent).
The algorithm continues to update when we feed the same data set again and again (i.e., epochs)
It does not solve non-linearly spreadable problems.

Logistic Regression as A Neural Network

m training samples:

where

Logistic Regression as A Neural Network

m training samples:

where

Loss function:
Cost function:
Goal: Find that mininizes

Forward Propagation

Backward Propagation

Stochastic Gradient Descent (SGD)

Neural Network: 0 Hidden Layer Neural Network

Neural Network: 1 Hidden Layer Neural Network

Neural Network: 1 Layer Neural Network

Neural Network: 1 Hidden Layer Neural Network

Across m Samples

MNIST Dataset

Contains 70000 handwritten labeled digit images with label (60000 training + 10000 testing)
Census Bureau employees and American high school students wrote these digits
Each image is 28x28 pixel in greyscale
Yann LeCun used convolutional network LeNet to achieve < 1% error rate at 1990s

Image Data

1 Hidden Layer Neural Network Example

Forward and Backward Propagation

MINST with one hidden layer step by step

Deep Neural Network

Batch, Mini-batch, Stochastic Gradient Descent

Mini-batch size = m: batch gradient descent, too long per iteration
Mini-batch size = 1: stochastic gradient descent, lose speed from vectorization
Mini-batch size in between: mini-batch gradient descent, make progress without processing all training set, typical batch sizes are , , ,

Batch, Mini-batch, Stochastic Gradient Descent

Activation Functions

Representational power: combination of linear and non-linear transformation
Some aspects that we can consider:
- input and output range
- gradients at initialization (usually small values)
- gradients at extremes
- computational complexity
Bad news: no one activation function is the best.
Good news: you can start with common use cases

Activation Functions

Intermediate layers
- Relu (i.e. rectified linear unit) is usually a good choice which has the following good properties:

fast computation;
non-linear;
reduced likelihood of the gradient to vanish;
Unconstrained response
- Sigmoid, studied in the past, not as good as Relu in deep learning, due to the gradient vanishing problem when there are many layers
- hyperbolic tangent function (tanh)

Last layer which connects to the output
- Binary classification: sigmoid with binary cross entropy as loss function
- Multiple class, single-label classification: softmax with categorical cross entropy for loss function
- Continuous responses: identity function (i.e. y = x)

Deal with Overfitting: Regularization

For logistic regression,

where

For neural network,

where

Deal with Overfitting: Dropout

Deal with Overfitting

Huge number of parameters, even with large amount of training data, there is a potential of overfitting
- Overfitting due to size of the NN (i.e. total number of parameters)
- Overfitting due to using the training data for too many epochs
Solution for overfitting due to NN size
- Dropout: randomly dropout some proportion (such as 0.3 or 0.5) of nodes at each layer, which is similar to random forest concept
- Using L1 or L2 regularization in the activation function at each layer
Solution for overfitting due to using too many epochs
- Run NN with large number of epochs to reach overfitting region through cross validation from training/validation vs. epoch curve
- Choose the model with number of epochs that have the minimum validation accuracy as the final NN model
- The optimal number for epoch is determined by when the model is not overfitted (i.e. validation accuracy reaches the best performance)

Exponentially Weighted Averages

Suppose we have the following 100 days’ temperature data:

The weighted average is defined as:

And we have:

Corrected Exponentially Weighted Averages

For example,

With v.s. Without Correction

Momentum

The momentum algorithm uses the exponentially weighted average of gradients to update the parameters.
On iteration t, compute , using samples in one mini-batch and update the parameters as follows:

intuition behind the moving average

RMSprop

It was proposed by Geoffrey Hinton at his Neural Networks Coursera course

On iteration t, compute dw, db using the current mini-batch.

Adaptive Moment Estimation (Adam)

On iteration t, compute , using the current mini-batch.

Recap of A Few Key Concepts

Data: Require large well-labeled dataset
Computation: intensive matrix-matrix operation
Structure of fully connected feedforward NN
- Size of the NN: total number of parameters
- Depth: total number of layers (this is where deep learning comes from)
- Width of a particular layer: number of nodes (i.e. neurons) in that layer
Activation function
- Intermediate layers
- Last layer connecting to outputs
Loss function
- Classification (i.e. categorical response)
- Regression (i.e. continuous response)
Optimization methods (SGD)
- Batch size
- Learning rate
- Epoch
Deal with overfitting
- Dropout
- Regularization (L1 or L2)

Hands-on

Databrick community edition
- Minimum language barrier in coding for most statisticians
- Zero setups to save time using the cloud environment
- Get familiar with the current trend of cloud computing in the industrial context
Your local machine ( vignette, rmd): Different versions of Python may cause some errors when running install_keras(). Here are the things you could do when you encounter the Python backend issue in your local machine.
- Run reticulate::py_config() to check the current Python configuration to see if anything needs to be changed.
- By default, install_keras() uses virtual environment ~/.virtualenvs/r-reticulate. If you don’t know how to set the right environment, try to set the installation method as conda (install_keras(method = "conda"))
- Refer to this document for more details on how to install keras and the TensorFlow backend.

Use Keras R Package

Data preprocessing (from image to list of input features)
- One image of 28x28 grey scale value matrix 784 column of features
- Scale the value to between 0 and 1, by divide each value by 255
- Make response categorical (i.e. 10 columns with the corresponding digit column 1 and rest columns zero.
Load keras package and build a neural network with a few layers
- Define a placeholder object for the NN structure
- 1st layer using 256 nodes, fully connected, using ‘relu’ activation function and connect from the input 784 features
- 2nd layer using 128 nodes, fully connected, using ‘relu’ activation function
- 3rd layer using 64 nodes, fully connected, using ‘relu’ activation function
- 4th layer using 10 nodes, fully connected, using ‘softmax’ activation function and connect to the output 10 columns
- Add drop out to the first three layers to prevent overfitting
Compile the NN model, define loss function, optimizer, and metrics to follow
Fit the NN model using the training dataset, define epoch, mini batch size, and validation size used in the training where the metrics will be checked
Predict using the fitted NN model using the testing dataset

Neural Networks and Deep Learning

Course Website

Types of Neural Network

A Little Bit of History – Perceptron

Perceptron Algorithm

Perceptron Algorithm

Logistic Regression as A Neural Network

Logistic Regression as A Neural Network

Forward Propagation

Forward Propagation

Backward Propagation

Stochastic Gradient Descent (SGD)

Neural Network: 0 Hidden Layer Neural Network

Neural Network: 1 Hidden Layer Neural Network

Neural Network: 1 Layer Neural Network

Neural Network: 1 Layer Neural Network

Neural Network: 1 Hidden Layer Neural Network

Across m Samples

Across m Samples

MNIST Dataset

Image Data

1 Hidden Layer Neural Network Example

Forward and Backward Propagation

Forward and Backward Propagation

Deep Neural Network

Batch, Mini-batch, Stochastic Gradient Descent

Batch, Mini-batch, Stochastic Gradient Descent

Activation Functions

Activation Functions

Activation Functions

Deal with Overfitting: Regularization

Deal with Overfitting: Dropout

Deal with Overfitting

Exponentially Weighted Averages

Corrected Exponentially Weighted Averages

With v.s. Without Correction

Momentum

RMSprop

Adaptive Moment Estimation (Adam)

Recap of A Few Key Concepts

Hands-on

Use Keras R Package

R Scripts