Money never sleeps, pal.

Sep 30, 2020

share code

PROJECT DETAILS

This project is on stock trading, specifically the SP500 & some very in demand stocks such as Apple, Amazon, and Google. The script will show the path that I took in order to get the most from the dataset visually and then explore a model that I found that could predict Apple’s stock to 92% accuracy. I then used the same model to the SP500 to show the ability to an index by using an LSTM model in keras. I hope you enjoy it and check out the code at my Github with the button above or share it with your network on Linkedin!

Math (Skip if you only want to see the fun stuff)

Ridge Regression is a way to create a model, when the number of predicator variables exceed the number of observations, or when the dataset has multicollinearity 🔔 (correlations between predictor variables). Also, the Ridge Regression is a L2 regression which add a penalty. The penalty is equal to the squares of the magnitude of coefficients. Penalty = Losing Money

So it is a perfect fit!

LSTM & Time Series LSTM is a recurrent neural network (RNN) that is trained by using Backpropagation through time and overcomes the vanishing gradient problem.

Long-Strong-Term Memory (LSTM) is the next generation of Recurrent Neural Network (RNN) used in deep learning for its optimized architecture to easily capture the pattern in sequential data aka STOCKS

🍻

Cheers! Now lets begin!!

IMPORT DATASETS AND LIBRARIES

# Data Maniupulation
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Data Visualization
import plotly.figure_factory as ff
import plotly.express as px

# Modeling
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge
from tensorflow import keras

# Additional
from copy import copy
from scipy import stats

# Stock prices data
stocks_df = pd.read_csv('/Users/andrewdarmond/Documents/FinanceML/stock.csv')

# Stocks volume data
stocks_vol_df = pd.read_csv('/Users/andrewdarmond/Documents/FinanceML/stock_volume.csv')

# Sort the data based on Date
stocks_df = stocks_df.sort_values('Date')

# Sort the volume data based on Date
stocks_vol_df = stocks_vol_df.sort_values('Date')

PERFORM EXPLORATORY DATA ANALYSIS AND VISUALIZATION

# Function to normalize stock prices based on their initial price
def normalize(df):
  x = df.copy()
  for i in x.columns[1:]:
    x[i] = x[i]/x[i][0]
  return x

# Function to plot interactive plots using Plotly Express
def interactive_plot(df, title):
  fig = px.line(title = title)
  for i in df.columns[1:]:
    fig.add_scatter(x = df['Date'], y = df[i], name =i)
  fig.show()

# plot interactive chart for stocks data
#interactive_plot(stocks_df, 'Stock Prices')

png

#interactive_plot(normalize(stocks_df), 'Normalize Stock Prices')

png

#interactive_plot(stocks_vol_df, 'Stocks Volume')

png

#interactive_plot(normalize(stocks_vol_df), 'Normalizes Stock Volume')

png

PREPARE THE DATA BEFORE TRAINING THE MODEL

# Function to concatenate the date, stock price, and volume in one dataframe
def individual_stock(price_df, vol_df, name):
  return pd.DataFrame({'Date': price_df['Date'], 'Close': price_df[name], 'Volume': vol_df[name]})

# Function to return the input/output (target) data for model
# Note that our goal is to predict the future stock price 
# Target stock price today will be tomorrow's price 
def trading_window(data):
  n = 1
  data['Target'] = data[['Close']].shift(-n)
  return data

If you want to view SP 500 / AMZN / ETC: Change ‘APPL’ HERE!

# Let's test the functions and get individual stock prices and volumes for AAPL
price_volume_df = individual_stock(stocks_df, stocks_vol_df, 'AAPL')
price_volume_df

	Date	Close	Volume
0	2012-01-12	60.198570	53146800
1	2012-01-13	59.972858	56505400
2	2012-01-17	60.671429	60724300
3	2012-01-18	61.301430	69197800
4	2012-01-19	61.107143	65434600
...	...	...	...
2154	2020-08-05	440.250000	30498000
2155	2020-08-06	455.609985	50607200
2156	2020-08-07	444.450012	49453300
2157	2020-08-10	450.910004	53100900
2158	2020-08-11	437.500000	46871100

2159 rows × 3 columns

price_volume_target_df = trading_window(price_volume_df)
price_volume_target_df

	Date	Close	Volume	Target
0	2012-01-12	60.198570	53146800	59.972858
1	2012-01-13	59.972858	56505400	60.671429
2	2012-01-17	60.671429	60724300	61.301430
3	2012-01-18	61.301430	69197800	61.107143
4	2012-01-19	61.107143	65434600	60.042858
...	...	...	...	...
2154	2020-08-05	440.250000	30498000	455.609985
2155	2020-08-06	455.609985	50607200	444.450012
2156	2020-08-07	444.450012	49453300	450.910004
2157	2020-08-10	450.910004	53100900	437.500000
2158	2020-08-11	437.500000	46871100	NaN

2159 rows × 4 columns

# Remove the last row as it will be a null value
price_volume_target_df = price_volume_target_df[:-1]
price_volume_target_df

	Date	Close	Volume	Target
0	2012-01-12	60.198570	53146800	59.972858
1	2012-01-13	59.972858	56505400	60.671429
2	2012-01-17	60.671429	60724300	61.301430
3	2012-01-18	61.301430	69197800	61.107143
4	2012-01-19	61.107143	65434600	60.042858
...	...	...	...	...
2153	2020-08-04	438.660004	43267900	440.250000
2154	2020-08-05	440.250000	30498000	455.609985
2155	2020-08-06	455.609985	50607200	444.450012
2156	2020-08-07	444.450012	49453300	450.910004
2157	2020-08-10	450.910004	53100900	437.500000

2158 rows × 4 columns

# Scale the data
sc = MinMaxScaler(feature_range = (0,1))
price_volume_target_scaled_df = sc.fit_transform(price_volume_target_df.drop(columns = ['Date']))

# Create Feature and Target
X = price_volume_target_scaled_df[:, :2]
y = price_volume_target_scaled_df[:, 2:]

price_volume_target_scaled_df.shape

(2158, 3)

X.shape, y.shape

((2158, 2), (2158, 1))

Spliting the data this way, since order is important in time-series

Note that we did not use train test split with it’s default settings since it shuffles the data

split = int(0.75 * len(X))
X_train = X[:split]
y_train = y[:split]
X_test = X[split:]
y_test = y[split:]

X_train.shape, y_train.shape

((1618, 2), (1618, 1))

X_test.shape, y_test.shape

((540, 2), (540, 1))

# Define a data plotting function
print(''' 
                                    APPLE 
''')

def show_plot(data, title):
    plt.figure(figsize = (13, 5))
    plt.plot(data, linewidth = 3)
    plt.title(title)
    plt.xlabel(xlabel= 'Data Variable')
    plt.ylabel(ylabel= 'Accuracy Relativity to 1' )
    plt.grid()

show_plot(X_train, 'Training Data')
show_plot(X_test, 'Testing Data')

                                            APPLE

png

BUILD AND TRAIN A RIDGE LINEAR REGRESSION MODEL

regression_model = Ridge()

# Test the model and calculate its accuracy 
regression_model.fit(X_train, y_train)

# Make Prediction
lr_accuracy = regression_model.score(X_test, y_test)
print('Ridge Regression Score:', lr_accuracy)

Ridge Regression Score: 0.9311227075637692

# Append the predicted values into a list
predicted_prices = regression_model.predict(X)

predicted = []
for i in predicted_prices:
  predicted.append(i[0])

# Append the close values to the list
close = []
for i in price_volume_target_scaled_df:
  close.append(i[0])

# Create a dataframe based on the dates in the individual stock data
df_predicted = price_volume_target_df[['Date']]

# Add the close values to the dataframe
df_predicted['Close'] = close

# Add the predicted values to the dataframe
df_predicted['Prediction'] = predicted
df_predicted

	Date	Close	Prediction
0	2012-01-12	0.011026	0.026286
1	2012-01-13	0.010462	0.025428
2	2012-01-17	0.012209	0.026527
3	2012-01-18	0.013785	0.027022
4	2012-01-19	0.013299	0.026992
...	...	...	...
2153	2020-08-04	0.957606	0.866550
2154	2020-08-05	0.961583	0.871436
2155	2020-08-06	1.000000	0.903353
2156	2020-08-07	0.972088	0.878730
2157	2020-08-10	0.988245	0.892666

2158 rows × 3 columns

# Plot the results
#interactive_plot(df_predicted, 'Original Vs. Predictions: Apple Stock(AAPL)')

png

TRAIN AN LSTM TIME SERIES MODEL

If you want to view APPL / AMZN / ETC: Change ‘sp500 HERE!

# Let's test the functions and get individual stock prices and volumes for sp500
price_volume_df = individual_stock(stocks_df, stocks_vol_df, 'sp500')

# Get the close and volume data as training data (Input)
training_data = price_volume_df.iloc[:, 1:3].values

# Normalize the data
sc = MinMaxScaler(feature_range= (0,1))
training_set_scaled = sc.fit_transform(training_data)

# Create the training and testing data, training data contains present day and previous day values
X = []
y = []   
for i in range(1, len(price_volume_df)):
  X.append(training_set_scaled[i-1:i, 0])
  y.append(training_set_scaled[i, 0])

# Convert the data into array format
X = np.array(X)
y = np.array(y)

# Split the data
split = int(0.7 * len(X))
X_train = X[:split]
y_train = y[:split]
X_test = X[split:]
y_test = y[split:]

# Reshape the 1D arrays to 3D arrays to feed in the model
X_train = np.reshape(X_train, (X_train.shape[0], X_train.shape[1], 1))
X_test = np.reshape(X_test, (X_test.shape[0], X_test.shape[1], 1))

# Create the model
inputs = keras.layers.Input(shape=(X_train.shape[1], X_train.shape[2]))
x = keras.layers.LSTM(150, return_sequences= True)(inputs)
x = keras.layers.Dropout(0.3)(x)
x = keras.layers.LSTM(150, return_sequences=True)(x)
x = keras.layers.Dropout(0.3)(x)
x = keras.layers.LSTM(150)(x)
outputs = keras.layers.Dense(1, activation='linear')(x)

model = keras.Model(inputs=inputs, outputs=outputs)
model.compile(optimizer='adam', loss="mse")
model.summary()

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         [(None, 1, 1)]            0         
_________________________________________________________________
lstm (LSTM)                  (None, 1, 150)            91200     
_________________________________________________________________
dropout (Dropout)            (None, 1, 150)            0         
_________________________________________________________________
lstm_1 (LSTM)                (None, 1, 150)            180600    
_________________________________________________________________
dropout_1 (Dropout)          (None, 1, 150)            0         
_________________________________________________________________
lstm_2 (LSTM)                (None, 150)               180600    
_________________________________________________________________
dense (Dense)                (None, 1)                 151       
=================================================================
Total params: 452,551
Trainable params: 452,551
Non-trainable params: 0
_________________________________________________________________

# Train the model
history = model.fit(X_train, y_train, epochs= 20, batch_size= 32, validation_split= 0.2)

Epoch 1/20
38/38 [==============================] - 7s 55ms/step - loss: 0.0539 - val_loss: 0.0653
Epoch 2/20
38/38 [==============================] - 0s 8ms/step - loss: 0.0095 - val_loss: 0.0055
Epoch 3/20
38/38 [==============================] - 0s 11ms/step - loss: 0.0012 - val_loss: 5.9507e-04
Epoch 4/20
38/38 [==============================] - 0s 10ms/step - loss: 3.8284e-04 - val_loss: 2.3346e-04
Epoch 5/20
38/38 [==============================] - 0s 9ms/step - loss: 3.5239e-04 - val_loss: 8.1833e-05
Epoch 6/20
38/38 [==============================] - 0s 10ms/step - loss: 3.5036e-04 - val_loss: 6.2046e-05
Epoch 7/20
38/38 [==============================] - 0s 8ms/step - loss: 3.0313e-04 - val_loss: 4.0566e-05
Epoch 8/20
38/38 [==============================] - 0s 9ms/step - loss: 2.8564e-04 - val_loss: 6.2951e-05
Epoch 9/20
38/38 [==============================] - 0s 9ms/step - loss: 3.1342e-04 - val_loss: 5.7098e-05
Epoch 10/20
26/38 [===================>..........] - ETA: 0s - loss: 2.9808e-04

# Make prediction
predicted = model.predict(X)

test_predicted = []

for i in predicted:
  test_predicted.append(i[0])

df_predicted = price_volume_df[1:][['Date']]

df_predicted['predictions'] = test_predicted

close = []
for i in training_set_scaled:
  close.append(i[0])

df_predicted['Close'] = close[1:]
df_predicted

# Plot the results
#interactive_plot(df_predicted, 'Original Vs Predictions: SP500')

png

Andrew D'Armond

Leveraging data science to achieve results