In the last article, we looked at models that deal with non-time-series data. Time to turn our heads towards some other models. Here we will be discussing deep sequential models. They are predominantly used to process/predict time series data.
Link to Part 1, in case you missed it.
Simple recurrent neural networks (referred to also as RNNs) are to time-series problems as CNNs to computer vision. In a time-series problem, you feed a sequence of values to a model and ask it to predict the next n values of that sequence. RNNs go through each value of the sequence while building up memory of what it has seen which helps it to predict what the future will look like. (Learn more about RNNs  )
Analogy: New and improved secret train
I’ve played this game as a kid and you might know this by a different name. Kids are asked to stand in a line and you whisper the first kid in the line a random word. The kid should add an appropriate word to that word and whisper that to the next kid, and so on. By the time the message reaches the last kid, you should have an exciting story brewed up by kid’s imagination.
Enter simple RNNs! This is the crux of a RNN. It takes some input at time t — x(t) (new word from last kid) and a state from time t-1 — h(t-1) (previous words of the message) as inputs and produce an output — y(t) (previous message + new word from last kid + your new word).
Once you train a RNN, you can (but generally you won’t) keep predicting forever, because the prediction of time t (i.e. y(t)) becomes the input at t+1 (i.e. y(t)=x(t+1)). Here’s what an RNN looks like in real world.
- Time series prediction (e.g. weather / sales predictions)
- Sentiment analysis — Given a movie/product review ( a sequence of words), predict if that’s negative/positive/neutral.
- Language modelling — Given a part of a story, imagine the rest of the story / Generate code from descriptions
LSTM is the cool new kid in RNN-ville. LSTM is a complicated beast than RNNs and able to remember things longer than RNNs. LSTMs would also go through each value of the sequence while building up memory of what it has seen which helps it to predict what the future will look like. But remember RNNs had a single state (that represented memory)? LSTMs have two states (one long-term and one short-term), thus the name LSTMs. (Learn more: LSTMs)
Analogy: Fast-food chain
All this explaining is making me hungry! So let’s go to a fast-food chain. This is a literal chain because, if you order a meal, one shop makes the burger, the other chips, and so on. In this fast-food drive through, you go to the first shop and say the following.
I need a burger with a toasted tiger bun and grilled chicken.
There’s one person that takes the order (green), and would send that information to the red person, let’s say he toasted the bun. When communicating with the blue person, he can drop the toasted part and say,
a burger with a tiger bun and grilled chicken
(we still need the grilled part because the next shop decides the sauce based on that). Then you drive to the next shop and say,
Add cheddar, large chips and I’m wearing a green t-shirt
Now, the green person knows his t-shirt color is completely irrelevant and drops that part. The shop also gets information from both red and blue from the previous shop. Next they would add the sauce, prepare the chips. The red person in the second shop will hold most of the order instructions, in case we need that later (if the customer complaints). But he’ll only say,
A burger and large chips
to the blue person as that’s all he needs to do his job. And finally, you get your order from the output terminal of the second shop.
LSTMs are not far from how this chain operated. At a given time t, it takes,
- an input x(t) (the customer in the example),
- an output state h(t-1) (the blue person from the previous shop) and
- a cell state c(t-1) (the red person from the previous shop).
- an output state h(t) (the blue person in this shop) and
- a cell state c(t) (the red person in this shop)
But rather than doing direct computations on these elements, the LSTM has a gating mechanism, that it can use to decide how much information from these elements it allows to flow through. For example, remember what happened when the customer said “I’m wearing a green t-shirt at the second shop”, the green person (the input gate) dropped that information because it’s not important for the order. Another example is when the red person drops the part that the bun is toasted in the first shop. There are many gates in an LSTM cell. Namely,
- An input gate (the green person) — discard information that’s not useful in the input.
- A forget gate (part of red person) — discard information that’s not useful in the previous cell state
- An output gate (part of blue person)- Discard information that’s not useful from cell state, to generate the output state
As you can see, the interactions are complicated. But the main takeaway is that,
An LSTM maintains two states (an output — short-term state and a cell state — long-term state) and uses gating to discard information when computing final and interim outputs.
Here’s what an LSTM would look like.
Phew! LSTMs really took a toll on time I got left. GRU is a successor to LSTMs that simplifies the mechanics of LSTMs future without jeopardising performance too much. (Learn more: GRUs  )
Analogy: Fast-food chain v2.0
Not to be a food critic, but the fast-food chain we saw earlier looks pretty inefficient. Is there a way to make it more efficient? Here’s one way.
- Get rid of the red person (cell state). Now both long and short term memories are managed by the green person (output state).
- There’s only an input gate and an output gate (i.e. no forget gate)
You can think of GRU as an inbetweener between simple RNN and LSTMs. Here’s what a GRU looks like.
We looked at simple RNNs, LSTMs and GRUs. Here are the main take-aways.
- Simple RNNs — A simple model that goes from one time step to the other while generating an output state on every step (no gating mechanism)
- LSTMs — Quite complicated. Has two states; a cell state (long-term) and an output state (short-term). It also has a gating mechanism to control how much information flows through the model.
- GRUs — An compromise between RNNs and LSTMs. Has only one output state but still has the gating mechanism.
Next up would include the hottest topic in deep learning; the Transformers.
Part 1: Feedforward models