Deep Learning for Financial Prediction

Deep learning is the glamorous cousin of classical ML. For finance, it is mostly over-hyped, occasionally useful, and almost always harder to deploy profitably than its advocates admit. This tutorial builds an LSTM for SPY direction prediction and tells you honestly why it probably will not make you rich.

Why LSTMs?

LSTMs (Long Short-Term Memory networks) are designed for sequential data. They maintain a hidden state across time steps and can, in theory, learn long-range dependencies. For markets, this sounds perfect - recent history should predict near-future direction.

In practice, LSTMs on financial data tend to overfit aggressively, learn spurious patterns in training data, and fail out-of-sample. Transformers have largely replaced LSTMs in language modelling but have not shown consistent gains in finance. The issue is not the architecture - it is the signal-to-noise ratio.

Install

pip install tensorflow pandas numpy scikit-learn yfinance

Feature Engineering

Do not feed raw prices. Use stationary features: returns at multiple horizons, volatility, normalised moving average deviations, RSI. The network cannot learn patterns from non-stationary inputs without first reinventing the very features you should have given it.

import numpy as np
import pandas as pd
import yfinance as yf
from sklearn.preprocessing import StandardScaler
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout
from tensorflow.keras.callbacks import EarlyStopping

def build_features(df):
    df = df.copy()
    df['ret_1']   = df['Close'].pct_change()
    df['ret_5']   = df['Close'].pct_change(5)
    df['vol_20']  = df['ret_1'].rolling(20).std()
    df['sma_20']  = df['Close'].rolling(20).mean() / df['Close'] - 1
    df['rsi']     = compute_rsi(df['Close'])

    # Target: next day direction
    df['target'] = (df['Close'].shift(-1) > df['Close']).astype(int)
    return df.dropna()

def compute_rsi(series, period=14):
    delta = series.diff()
    gain = delta.clip(lower=0).ewm(alpha=1/period).mean()
    loss = -delta.clip(upper=0).ewm(alpha=1/period).mean()
    return 100 - (100 / (1 + gain/loss))

Create Sequences

LSTMs expect input shaped as (samples, timesteps, features). We build sliding windows of 30 days of features. Each sequence predicts the direction of the day after the window ends. Critically, the train/test split is by date - never shuffle time series.

def make_sequences(X, y, lookback=30):
    """Convert feature matrix into (samples, timesteps, features) tensor."""
    Xs, ys = [], []
    for i in range(lookback, len(X)):
        Xs.append(X[i-lookback:i])
        ys.append(y[i])
    return np.array(Xs), np.array(ys)

FEATURES = ['ret_1', 'ret_5', 'vol_20', 'sma_20', 'rsi']
LOOKBACK = 30

df = yf.download('SPY', start='2010-01-01', end='2024-12-31')
df = build_features(df)

# Train/test split by date, NOT random
split_date = '2022-01-01'
train_df = df[df.index < split_date]
test_df  = df[df.index >= split_date]

scaler = StandardScaler()
X_train_raw = scaler.fit_transform(train_df[FEATURES])
X_test_raw  = scaler.transform(test_df[FEATURES])

X_train, y_train = make_sequences(X_train_raw, train_df['target'].values, LOOKBACK)
X_test,  y_test  = make_sequences(X_test_raw,  test_df['target'].values,  LOOKBACK)

Build and Train

Keep the network small. Two LSTM layers with modest units, heavy dropout, and early stopping. Financial data does not have enough signal to support large models - the bigger the model, the more confidently it memorises noise.

def build_lstm(lookback, n_features):
    model = Sequential([
        LSTM(32, return_sequences=True, input_shape=(lookback, n_features)),
        Dropout(0.3),
        LSTM(16),
        Dropout(0.3),
        Dense(8, activation='relu'),
        Dense(1, activation='sigmoid')
    ])
    model.compile(
        optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
        loss='binary_crossentropy',
        metrics=['accuracy']
    )
    return model

model = build_lstm(LOOKBACK, len(FEATURES))
model.summary()

early_stop = EarlyStopping(
    monitor='val_loss',
    patience=10,
    restore_best_weights=True
)

history = model.fit(
    X_train, y_train,
    validation_split=0.2,
    epochs=100,
    batch_size=32,
    callbacks=[early_stop],
    verbose=1
)

# Evaluate on truly held-out data
loss, acc = model.evaluate(X_test, y_test)
print(f"Test accuracy: {acc:.3f}")

What You Will Actually See

On SPY daily direction, expect test accuracy between 50% and 54%. On rare runs you will see 55-57% which will feel like magic - it almost always fails to replicate on a different time period. Training accuracy will often reach 60-70%, which is the classic signature of overfitting.

If you see 65% test accuracy, assume a bug. Check for: look-ahead bias in features, training data leaking into test, scaler fit on full dataset, or target accidentally including future information.

Why It Is Hard

Low signal-to-noise: daily returns are dominated by noise. LSTMs learn the noise as readily as the signal.
Non-stationarity: the data-generating process changes over time. A model trained on 2010-2020 markets may be worthless in 2024. Regularisation helps, regime detection helps more.
Feedback loops: if your model works, others find similar edges and trade them away. Profitable ML signals decay.
Small data: 10 years of daily data is 2500 samples. Deep learning thrives on millions. For daily frequency, you are always data-starved. Intraday data helps but introduces new problems (microstructure, costs, survivorship).
Evaluation is hard: accuracy is not the right metric. Economic value depends on magnitude, position sizing, and transaction costs. A model that is right 52% of the time on big moves can beat one that is right 56% on small ones.

Where Deep Learning Actually Helps

Honest wins for deep learning in finance tend to be in alternative data (sentiment analysis on news and social media), options pricing for exotic instruments, microstructure modelling on tick data, and anomaly detection for fraud or operational risk. Pure price prediction from OHLC data is the hardest place to apply deep learning and the least likely to pay off.

Honest Limitations

If you take away one thing: a 52% directional accuracy with disciplined position sizing and low costs is a real edge and rare. A 60% claim with no code, no out-of-sample test, and no transaction costs is either a mistake or a lie. Be more sceptical of your own results than anyone elses.