We will be training a recurrent neural network to predict Amazon stock prices. We can collect this from the
pandas_datareader. The stock data is stored into a DataFrame named
We’ll predict the closing prices, which can be accessed in the
Close column. These prices will be separated into a training set and a testing set.
test are long sequences of stock prices. We would like to convert these sequences into
y sets, where
x represents a sequence of prices and
y is the next price.
Consider the following sequence:
We would want to generate the following sequences, in this case with a ‘window size’ of three.
This can be programmed using the following code:
Great! We’ve collected our training data. One more thing, though — currently the data is in list form. In order to run it through a Keras model, data must almost always be in array form. Additionally, the
X must be a three-dimensional array — in this case, we reshape the third dimension of
X as 1.
It’s worth exploring why RNNs require a three-dimensional input (at least, their implementations in Keras). In the case of stock prediction, at each time step, there is only one data point — the stock price.
However, consider a RNN learning to generate a sequence of movements
left; each time step has four values (a 1or a 0 for each direction). For example,
right might be represented as
[0, 0, 1, 0]. If there are 5000 items in the training set and each had a sequence length of 10 movements, the shape of the data would be
(5000, 10, 4).
Similarly, in text generation that is character-by-character based, each time step has at least 26 different values of 0s or 1s, indicating letters of the alphabet. Often, additional characters include punctuation or spaces. This idea will be iterated upon more later.
Now that the training data has been created, we can get started constructing the recurrent neural network. The simplest RNN has two layers: a standard recurrent layer and a standard dense layer, which will be connected through a
Sequential model. We can go ahead and import these.
We can begin creating a recurrent neural network now. Although it’s not entirely accurate, one can think of the
SimpleRNN(10, …) as having ’10 neurons’, much like a dense layer. Since we are predicting single values on a continuous scale, the last dense layer has one neuron and a linear activation.
Because our sequences have 50 elements and each can be represented using only one value (opposed to something like 26 for the alphabet), the input shape of a sequence is
Lastly, the Keras model must be compiled with a loss (default mean squared error for regression), an optimizer (Adam is a default), and optional metrics to track the progress (mean absolute error). We’ll train the model on
y_train for 500 epochs and save training data to
A recurrent layer can be thought of as parsing several inputs, taking into account sequential order — if it helps, it’s can be thought of as a derivative, finding a generalizable pattern across sequences. Stacking two recurrent layers, then, is like taking a double derivative, or the ‘difference of the difference’. Much like stacking multiple convolutional layers, it allows for more complex relationships to be identified.
If one tries to stack two recurrent layers naively, it won’t work: a parameter,
return_sequences=True, must be added. This returns the output as a sequence that can be inputted into another recurrent layer. We may decide to design a deep recurrent neural network with several stacked layers as follows:
LSTMs, or Long Short-Term Memory networks, are an improvement upon naïve recurrent neural networks because they can ‘memorize’ important information across long input sequences. The interface for using LSTMs is the same for the
Like RNNs, LSTMs can be stacked to develop complexity and deeper understanding of patterns in the inputs. Note that because Long Short-Term Memory networks have a variety of memory mechanisms, they are more expensive to train than standard recurrent neural networks.
Similarly, the Gated Recurrent Unit is another recurrent layer. Like the LSTM, its goal is retain important information over long sequences of inputs through gates, but approaches the task in a different method. It is implemented just like the
LSTM layers at
A note on our task of stock forecasting — it’s bad practice to have the network predict stocks in their raw values (e.g. $576, $598, $589, …) because of extrapolation. The main idea is that stocks, especially those of a high-growth company like Amazon, are continually rising, and that it’s difficult for models to predict values in a range it hasn’t been trained on.
If one were to train a model of data from 2000 to 2020 where prices ranged from x to y, the model will have difficulty predicting prices above y or below x. Theoretically, a recurrent network should be able to carry out base-level relationships by itself, but it’s always good to reduce the work it needs to do.
With any forecasting task, difference the data beforehand. Differencing transforms the dataset such that each value is a change from the previous one, and can either be done absolutely (+3, -4, +2, +1) or on a percentage scale (95%, 103%, 105%). The benefit of differencing is that values lie on a more centered and stationary scale that is easier for a model to operate on.
Stock forecasting is a simpler usage of recurrent neural networks, since (aside from differencing) there is little preprocessing that needs to be done, since the data is inherently numerical. Other applications of recurrent neural networks may not be so clean, particularly text.
For example, if I were to train recurrent neural networks to predict the next character sequence, the vectorized input of ‘abe’ would look like this:
To put some jargon in the example: a ‘sequence’ is a collection of timesteps, in this case being
[‘a’, ‘b’, ‘e’]. A timestep is a vector with a length; for instance,
[1, 0, 0, 0, 0, …] for
‘a’. The dataset is a collection of sequences, so the shape of a sequential dataset is (number of sequences, number of timesteps within a sequence, number of components in a timestep).
It’s usually advisable to use embeddings with large input sequences, which map long sequences as points in an embedding space. This can improve performance and training time: in a properly developed embedding space, tokens (words) that have similar meanings in the context of the dataset are placed physically close together in the embedding.
When embeddings are being used, three parameters are required: the input dimensions, the output dimensions, and the input length, which are 1000, 64, and 50 in this case, respectively.
- The input dimensions refers to the size of the vocabulary. For instance, if there were 27 total unique characters in the vocabulary (alphabet and the space), the inputted value would be 27. This is the third number in the tuple (number of sequences, number of timesteps within a sequence, number of components in a timestep).
- The output dimensions refers to the dimensionality of the output. Like mentioned before, think of embedding as a specialized dimensionality reduction.
- The input length refers to the number of timesteps within each sequence, if it is fixed. This is the second number in the tuple (number of sequences, number of timesteps within a sequence, number of components in a timestep).
Afterwards, recurrent, LSTM, and GRU layers can be stacked on top of the embedding layer. It’s always wise to stack several Dense layers, along with the standard ANN shebang — batch norm, dropout, etc. Once you have a solid understanding of the dynamics of recurrent neural networks, they’re not hard at all to implement.