Recurrent Neural Networks
Hui Lin @Google
Types of Neural Network
Why sequency?
Speech Recognition |
|
\(\longrightarrow\) |
Get your facts first, then you can distort them as you please. |
Music generation |
\(\emptyset\) |
\(\longrightarrow\) |
|
Sentiment classification |
Great movie ? Are you kidding me ! Not worth the money. |
\(\longrightarrow\) |
|
DNA sequence analysis |
ACGGGGCCTACTGTCAACTG |
\(\longrightarrow\) |
AC GGGGCCTACTG TCAACTG |
Machine translation |
网红脸 |
\(\longrightarrow\) |
Internet celebrity face |
Video activity recognition |
|
\(\longrightarrow\) |
Running |
Name entity recognition |
Use Netlify and Hugo. |
\(\longrightarrow\) |
Use Netlify and Hugo. |
RNN types
- rectangle: a vector
- green: input vector
- blue: output vector
- red: intermediate state vector
- arrow: matrix multiplications
Notation
x: Use(\(x^{<1>}\)) Netlify(\(x^{<2>}\)) and(\(x^{<3>}\)) Hugo(\(x^{<4>}\)) .(\(x^{<5>}\))
y: 0 (\(y^{<1>}\)) 1(\(y^{<2>}\)) 0(\(y^{<3>}\)) 1(\(y^{<4>}\)) 0(\(y^{<5>}\))
\(x^{(i)<t>}\), \(T_x^{(i)}\) (\(i^{th}\) sample)
\(y^{(i)<t>}\), \(T_y^{(i)}\) (\(i^{th}\) sample)
Representing words
\(\left[\begin{array}{c} a[1]\\ aaron[2]\\ \vdots\\ and[360]\\ \vdots\\ Hugo[4075]\\ \vdots\\ Netlify[5210]\\ \vdots\\ use[8320]\\ \vdots\\ Zulu[10000] \end{array}\right]\Longrightarrow use=\left[\begin{array}{c} 0\\ 0\\ \vdots\\ 0\\ \vdots\\ 0\\ \vdots\\ 0\\ \vdots\\ 1\\ \vdots\\ 0 \end{array}\right], Netlify=\left[\begin{array}{c} 0\\ 0\\ \vdots\\ 0\\ \vdots\\ 0\\ \vdots\\ 1\\ \vdots\\ 0\\ \vdots\\ 0 \end{array}\right], and=\left[\begin{array}{c} 0\\ 0\\ \vdots\\ 1\\ \vdots\\ 0\\ \vdots\\ 0\\ \vdots\\ 0\\ \vdots\\ 0 \end{array}\right], Hugo=\left[\begin{array}{c} 0\\ 0\\ \vdots\\ 0\\ \vdots\\ 1\\ \vdots\\ 0\\ \vdots\\ 0\\ \vdots\\ 0 \end{array}\right]\)
What is RNN?
x: Use(\(x^{<1>}\)) Netlify(\(x^{<2>}\)) and(\(x^{<3>}\)) Hugo(\(x^{<4>}\)) .(\(x^{<5>}\))
y: 0 (\(y^{<1>}\)) 1(\(y^{<2>}\)) 0(\(y^{<3>}\)) 1(\(y^{<4>}\)) 0(\(y^{<5>}\))
\(x^{(i)<t>}\), \(T_x^{(i)}\) (\(i^{th}\) sample)
\(y^{(i)<t>}\), \(T_y^{(i)}\) (\(i^{th}\) sample)
Forward Propagation
\(a^{<0>}= \mathbf{o}\); \(a^{<1>} = g(W_{aa}a^{<0>} + W_{ax}x^{<1>} + b_a)\)
\(\hat{y}^{<1>} = g'(W_{ya}a^{<1>} + b_y)\)
\(a^{<t>} = g(W_{aa}a^{<t-1>} + W_{ax}x^{<t>} + b_a)\)
\(\hat{y}^{<t>} = g'(W_{ya}a^{<t>} + b_y)\)
Forward Propagation
\(L^{<t>}(\hat{y}^{<t>}) = -y^{<t>}log(\hat{y}^{<t>}) - (1-y^{<t>})log(1-\hat{y}^{<t>})\)
\(L(\hat{y}, y) = \Sigma_{t=1}^{T_y}L^{<t>} (\hat{y}^{<t>}, y^{<t>})\)
Backpropagation through time
Deep RNN
Vanishing gradients with RNNs
- The cat, which ate already, was full.
- The cats, which ate already, were full.
LSTM
LSTM
LSTM
LSTM
Word representation
- Vacabulary = [a, aaron, …, zulu, ], |V|=10,000
- One hot representation
\[\begin{array}{cccccc}
Man & Woman & King & Queen & Apple & Pumpkin\\
(5391) & (9853) & (4914) & (7157) & (456) & (6332)\\
\left[\begin{array}{c}
0\\
0\\
0\\
0\\
\vdots\\
1\\
\vdots\\
0\\
0
\end{array}\right] & \left[\begin{array}{c}
0\\
0\\
0\\
0\\
0\\
\vdots\\
1\\
\vdots\\
0
\end{array}\right] & \left[\begin{array}{c}
0\\
0\\
0\\
\vdots\\
1\\
\vdots\\
0\\
0\\
0
\end{array}\right] & \left[\begin{array}{c}
0\\
0\\
0\\
0\\
0\\
\vdots\\
1\\
\vdots\\
0
\end{array}\right] & \left[\begin{array}{c}
0\\
\vdots\\
1\\
\vdots\\
0\\
0\\
0\\
0\\
0
\end{array}\right] & \left[\begin{array}{c}
0\\
0\\
0\\
0\\
0\\
\vdots\\
1\\
\vdots\\
0
\end{array}\right]
\end{array}\]
Word representation
- My favourite Christmas dessert is pumpkin ____
- My favourite Christmas dessert is apple ____
\[\begin{array}{cccccc}
Man & Woman & King & Queen & Apple & Pumpkin\\
(5391) & (9853) & (4914) & (7157) & (456) & (6332)\\
\left[\begin{array}{c}
0\\
0\\
0\\
0\\
\vdots\\
1\\
\vdots\\
0\\
0
\end{array}\right] & \left[\begin{array}{c}
0\\
0\\
0\\
0\\
0\\
\vdots\\
1\\
\vdots\\
0
\end{array}\right] & \left[\begin{array}{c}
0\\
0\\
0\\
\vdots\\
1\\
\vdots\\
0\\
0\\
0
\end{array}\right] & \left[\begin{array}{c}
0\\
0\\
0\\
0\\
0\\
\vdots\\
1\\
\vdots\\
0
\end{array}\right] & \left[\begin{array}{c}
0\\
\vdots\\
1\\
\vdots\\
0\\
0\\
0\\
0\\
0
\end{array}\right] & \left[\begin{array}{c}
0\\
0\\
0\\
0\\
0\\
\vdots\\
1\\
\vdots\\
0
\end{array}\right]
\end{array}\]
Featurized representation: word embedding
Featurized representation: word embedding
Featurized representation: word embedding
Featurized representation: word embedding
Featurized representation: word embedding
Featurized representation: word embedding
Featurized representation: word embedding
Featurized representation: word embedding
Featurized representation: word embedding
Featurized representation: word embedding
Analogies
- man \(\longrightarrow\) woman \(\approx\) king \(\longrightarrow\) ?
Analogies
- man \(\longrightarrow\) woman \(\approx\) king \(\longrightarrow\) ?
Analogies
- man \(\longrightarrow\) woman \(\approx\) king \(\longrightarrow\) ?
Analogies
- \(e_{man} - e_{woman} = [-2, -0.01, 0.03, 0]^{T} \approx [-2, 0, 0, 0]^{T}\)
- \(e_{king} - e_{queen} = [-1.92, -0.02, 0.01, -0.01]^{T} \approx [-2, 0, 0, 0]^{T}\)
Analogies
\(e_{man} - e_{woman} \approx e_{king} - e_{?}\)
\(\rightarrow \underset{w}{argmax} \{sim (e_{w}, e_{king} - e_{man} + e_{woman})\}\)
Cosine similarity
\(sim(e_w, e_{king}-e_{man}+e_{woman})\) = ?
Cosine similarity: \(sim(a,b) = \frac{a^{T}b}{ ||a||_{2} ||b||_{2}}\)
Cosine similarity
\(sim(e_w, e_{king}-e_{man}+e_{woman})\) = ?
Cosine similarity: \(sim(a,b) = \frac{a^{T}b}{ ||a||_{2} ||b||_{2}}\)
Embedding matrix
Embedding matrix
- In practice, we look up embedding instead of doing matrix multiplication.
Data Preprocessing
Data Preprocessing