Machine Learning for Finance: Intro

Most ML for finance tutorials lie to you. They train on random splits, leak future data into features, claim 70% accuracy on stock prediction, and never acknowledge that 70% accuracy would make you richer than Renaissance Technologies. This guide is the honest version.

What ML Can and Cannot Do

Financial time series are close to efficient, heavily non-stationary, and dominated by noise. Out-of-sample directional accuracy above 55% is rare and hard-won. A 52-53% edge, properly compounded, is a successful hedge fund. If your first model claims 70%, you have a bug - usually data leakage.

ML is useful for: combining many weak signals into a slightly-better-than-random classifier, modelling non-linear interactions between features, detecting regime changes, pricing derivatives with complex payoffs. It is mostly useless for: predicting tomorrows price precisely, finding alpha from standard OHLC data alone, beating simple rules on liquid large caps.

Setup

import numpy as np
import pandas as pd
import yfinance as yf
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import accuracy_score, classification_report

Feature Engineering

In quant finance, features matter far more than model choice. A random forest on well-engineered features beats a neural network on raw OHLC every time. Start with returns at multiple horizons, moving average ratios, volatility, volume anomalies, and classic indicators like RSI.

def build_features(df):
    df = df.copy()

    # Returns
    df['ret_1']  = df['Close'].pct_change(1)
    df['ret_5']  = df['Close'].pct_change(5)
    df['ret_20'] = df['Close'].pct_change(20)

    # Moving averages and ratios
    df['sma_20'] = df['Close'].rolling(20).mean()
    df['sma_50'] = df['Close'].rolling(50).mean()
    df['sma_ratio'] = df['sma_20'] / df['sma_50']

    # Volatility (20-day realised)
    df['vol_20'] = df['ret_1'].rolling(20).std()

    # Volume signals
    df['vol_change'] = df['Volume'] / df['Volume'].rolling(20).mean()

    # RSI
    delta = df['Close'].diff()
    gain = delta.clip(lower=0).ewm(alpha=1/14).mean()
    loss = -delta.clip(upper=0).ewm(alpha=1/14).mean()
    df['rsi'] = 100 - (100 / (1 + gain/loss))

    # TARGET: direction of next day return, NOT the magnitude
    df['target'] = (df['Close'].shift(-1) > df['Close']).astype(int)

    return df.dropna()

Predict Direction, Not Price

Regressing raw prices is almost always wrong. Prices are non-stationary - tomorrows price correlates strongly with todays, so even a constant predictor will show low error and look great. Swap to predicting the direction of tomorrows return as a binary classification. Accuracy then becomes comparable to a coin flip at 50%.

Training with Time-Series Split

This is the single biggest mistake in ML-finance tutorials. If you use a random train/test split on time series, the test set contains dates earlier than the training set. The model effectively sees the future. Use TimeSeriesSplit, which respects temporal order: train on early data, test on later data, roll forward.

def train_and_evaluate(df):
    features = ['ret_1', 'ret_5', 'ret_20', 'sma_ratio', 'vol_20', 'vol_change', 'rsi']
    X = df[features]
    y = df['target']

    # CRITICAL: time-series split, never random split
    tscv = TimeSeriesSplit(n_splits=5)
    accuracies = []

    for fold, (train_idx, test_idx) in enumerate(tscv.split(X)):
        X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
        y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]

        model = RandomForestClassifier(
            n_estimators=200,
            max_depth=5,
            min_samples_leaf=20,
            random_state=42
        )
        model.fit(X_train, y_train)
        preds = model.predict(X_test)
        acc = accuracy_score(y_test, preds)
        accuracies.append(acc)
        print(f"Fold {fold+1} accuracy: {acc:.3f}")

    print(f"Mean accuracy: {np.mean(accuracies):.3f}")
    return model

# Usage
df = yf.download('SPY', start='2010-01-01', end='2024-12-31')
df = build_features(df)
model = train_and_evaluate(df)

Pitfalls to Avoid

Look-ahead bias: any feature computed at time t using data from t+1 or later is leakage. Rolling features with default window alignment are safe. Any hand-rolled feature needs checking.
Survivorship bias: training on current S&P 500 constituents. Companies that went bankrupt are absent, inflating returns. Use point-in-time universe data.
Data snooping: testing 500 hyperparameter combinations on the same holdout set. Eventually one works by chance. Use nested cross-validation or a completely untouched final test set.
Ignoring transaction costs: a strategy with 55% accuracy that trades daily can be net-negative after spread and commission.
Static models: markets change. A model trained on 2010-2020 may fail in 2024. Retrain regularly, monitor out-of-sample drift.
Class imbalance: in sideways markets, up days and down days are roughly balanced. In trending markets, they are not. Report precision, recall, and F1, not just accuracy.

The Realistic Workflow

A real quant ML workflow looks nothing like the sklearn tutorials. It looks like this:

Formulate a hypothesis grounded in economics, not data-dredged.
Build features that encode that hypothesis.
Test on a single simple model (logistic regression, random forest) with time-series CV.
Include transaction costs and slippage in the evaluation.
If there is a signal, stress test under different market regimes.
Paper trade for months before a single dollar of real capital.
Continuously monitor decay - most edges erode.

This is unglamorous and slow. It is also how serious quants work. The tutorials that promise 5 minutes to a profitable model are selling courses, not returns.