LSTM to predict Dow Jones Industrial Average: A Time Series forecasting model

Dow Jones Industrial Average (DJIA) tracks 30 large, publicly-owned companies trading on the New York Stock Exchange (NYSE) and the NASDAQ. Although, the Industrial part of the name is largely historical, as most of the modern 30 components have little or nothing to do with traditional heavy industry. It uses the price-weighted index which means that, stocks with a higher share price carry a greater weight in the index than stocks with a low share price.

Here, we will experiment with the LSTM network and historical data since 2000, and investigate that, how accurately we can predict future Closing price. To keep this simple, we will not do much feature engineering here, rather, will forecast the Close price index based historical open, close, high, low & volume.

Problem statement

The problem considered here are-

  • Regression Predictive Modeling Problem (trying to forecast exact Closing price the next day

Neural Network

Let us understand a bit of theoretical concept about neural network.

Neural networks have an input layer, one or many hidden layers, and an output layer. The number of hidden layers defines just how deep the neural network is.

Neural networks perform representation learning, where each layer of the neural network learns a representation from the previous layer. Each layer has a certain number of nodes (neurons) that comprise the layer. The nodes of each layer are then connected to the nodes of the next layer. During the training process, the neural network determines the optimal weights to assign to each node. These nodes are fed into an activation function, which determines what value of the current layer is fed into the next layer of the neural network.

Neural networks may also have bias nodes which are always constant values and, unlike the normal nodes, are not connected to the previous layer. With the hidden layers and nodes, bias nodes and activation functions , the network tried to learn the right function approximation to use to map the input layer to the output layer.

The prediction problems in time series are a different type of modeling problem. Time series adds the complexity of a sequence dependence among the input variables. In this context, we all are aware that, LSTM has the ability to memorize the previous inputs in-memory when a huge set of sequential data is given to it. The gated architecture of LSTM’s has the ability to manipulate its memory state.

However, before applying deep learning based neural network, we should be aware of pros and cos.


  • High prediction accuracy and can capture complex underlying patterns in the data
  • The network’s hidden layers helps to reduce the need for feature engineering remarkably


  • Require a huge amount of computing power
  • Feature scaling is must; also need vast number of data for neural network training.
Image for post
Image for post
Image for post
Image for post

We can see an upward trend in the data-set over time. We will keep the things simple and work with the data as-is basis.

Image for post

New features

Here, we added 2 new features and separated our feature columns in a new data frame.

Image for post

Train-test split

We will keep last 20% values to validate the training set.

Image for post


Image for post

Data preparation

Here we will shape the data for LSTM. I have considered 60 days as look back period.

Image for post
Image for post

We have converted the data to 3D shape for LSTM to work.

LSTM network architecture

Once the Sequential model is called, we have to specify the input shape by designating the number of dimensions in the original feature (4 here). The dense unit can be added as per number of features. I have added 10, 20 and 30 in the subsequent hidden layers.

We also need to specify the activation function to be applied to the input layer and the number of nodes we want the hidden layer to have. LSTM learns the weights to apply to the nodes at each of the layers. Activation function determines whether the nodes will be activated or not for use in the next layer. We have used rectified linear unit (ReLu) which is non-linear activation function. Compared to others (Tanh, Sigmoid etc.), the advantage with ReLu is that, it is least computationally expensive in non-linear category.

We have not done any data shuffling using cross validation. Cross validation always provides a robust estimate of the performance of a model on unseen data but comes at computation time and cost involved. For large data set with 5 or 10 fold validation may require 15/20 minutes additional time to run a cross validation and model evaluation.

Image for post
Image for post

In order to compile the layers we need to select a loss function, an optimizer to set the process by which the weights are learned, and a list of metrics to output to help us evaluate the goodness of the neural network. We want to use MSE as the evaluation metric.


Neural networks train for many rounds (known as epochs). In each batch of these epochs, the network will re-adjusts its learned weights to reduce its loss from the batches in the previous epoch. We shall set an optimizer to help the network efficiently learn the optimal weights to minimize the loss function. We will use Adam optimization algorithm which dynamically adjusts the learning rate over the course of the training process.

Training the Model

Finally, we need to choose the evaluation metric.

We also need to select the number of epochs and the batch size and then begin the training process by calling the fit method. The number of epochs determines the number of times the training occurs over the entire data set we pass into the neural network. We will set this to 10.

The batch sets the number of samples the neural network trains on before making the next gradient update. We will set this to a 32 samples.

Image for post
Image for post

Now, we need to test the model. So, let’s prepare our test data-set.

Image for post
Image for post
Image for post

So, our predicted values are scaled down and we need to bring them back to normal level. Below gives an input of all the columns (open, close, high, low and volume) where 1st number belong to ‘open’.

We can do this by inverse function; however, I have used hand engineering to bring scaled down numbers to normal.

Image for post

Predicted Vs Actual plot

Image for post

Accuracy score

Image for post

We have attained 90% accuracy score from the existing data set.

However, there are still rooms for improvement. We can always do feature engineering and add different features. It involves (1)fundamental analysis comprising economic factors e.g. balance sheet, income statement, company’s assets, liabilities and equity of shareholders etc. (2) technical analysis which uses a number of various types of indicators to make predictions on where the price is headed.


Neural networks are quite powerful and capable of modeling complex nonlinear relationships to a degree that classical machine learning algorithms struggle with. However, there is a potential risk because neural networks can model such complex nonlinear relationships, they are prone to over-fitting, which we should be aware of when designing neural networks.

We also must perform a lot more hyper-parameter optimization for better performance. The hyper-parameters include the cost function, the type of initialization for the starting weights, the number of epochs, the batch size, and the learning rate during the training process.

Related Post