Building Neural Networks for Algorithmic Trading

Why Neural Networks for Trading?

The promise of neural networks in finance isn’t about finding a magic formula — it’s about learning nonlinear representations of market microstructure that traditional linear factor models miss. After spending years iterating on architectures for time-series financial data, I’ve developed a pragmatic view of what works, what doesn’t, and where the real engineering challenges lie.

Financial time series are fundamentally different from the domains where deep learning first thrived. Markets are non-stationary, adversarial, and plagued by a signal-to-noise ratio that makes ImageNet look trivial. The distribution shifts constantly — a model trained on 2020 pandemic volatility will give you garbage predictions during a 2023 low-vol grind. This is the central challenge, and no architecture choice alone solves it.

Feature Engineering: The Real Alpha

Before touching any model code, the feature pipeline is where most of the edge lives. Raw OHLCV data is a starting point, but production systems need to go deeper:

Technical indicators — RSI, MACD, Bollinger Bands, ATR — normalized and z-scored against rolling windows to handle non-stationarity
Orderbook features — bid-ask spread, depth imbalance ratios, trade flow toxicity (VPIN)
Cross-asset signals — BTC dominance, DXY correlation, sector ETF momentum
Temporal encodings — hour-of-day, day-of-week, time-to-funding (for perpetual futures)

The critical mistake most beginners make is feeding raw price data into a model. Prices are non-stationary by definition. You need to work with returns, log-returns, or z-scored features. I typically normalize all inputs against a 252-period rolling window (one trading year) and clip outliers at ±3σ.

Architecture Choices: LSTMs vs Transformers

I’ve run extensive experiments with both LSTMs and Transformer-based architectures. Here’s the honest breakdown:

LSTMs remain surprisingly competitive for single-asset, single-timeframe prediction. They’re cheaper to train, easier to debug, and less prone to overfitting on small datasets. For a model consuming 60-minute look-back windows of ~30 features, a 2-layer LSTM with 128 hidden units is a strong baseline.

Transformers shine when you have multi-asset, multi-timeframe inputs. The self-attention mechanism naturally captures cross-asset dependencies that LSTMs struggle with. I use a modified architecture inspired by the Temporal Fusion Transformer (TFT), which separates static covariates from temporal features and uses variable selection networks to automatically learn feature importance.

Here’s a simplified PyTorch implementation of a time-series prediction model:

import torch
import torch.nn as nn

class TimeSeriesPredictor(nn.Module):
    def __init__(self, input_dim, hidden_dim=128, num_layers=2, dropout=0.3):
        super().__init__()
        self.encoder = nn.LSTM(
            input_size=input_dim,
            hidden_size=hidden_dim,
            num_layers=num_layers,
            batch_first=True,
            dropout=dropout
        )
        self.attention = nn.MultiheadAttention(
            embed_dim=hidden_dim,
            num_heads=4,
            dropout=dropout,
            batch_first=True
        )
        self.fc = nn.Sequential(
            nn.Linear(hidden_dim, 64),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(64, 3)  # [-1, 0, 1] → short, flat, long
        )

    def forward(self, x):
        lstm_out, _ = self.encoder(x)
        attn_out, _ = self.attention(lstm_out, lstm_out, lstm_out)
        out = attn_out[:, -1, :]  # take last timestep
        return self.fc(out)

The Overfitting Trap

Financial ML has a brutal overfitting problem. With thousands of potential features and relatively short training histories, the model will happily memorize noise. My defenses:

Purged cross-validation — Standard k-fold is invalid for time series. I use purged, embargo’d walk-forward splits with a gap between train and test sets to prevent lookahead leakage.
Aggressive regularization — Dropout of 0.3-0.5, weight decay, and early stopping monitored on a separate validation set.
Ensemble averaging — Train 5-10 models with different random seeds and average their predictions. This alone reduces variance by ~30%.
Feature ablation — Systematically remove features and measure performance impact. If removing a feature doesn’t hurt, it was likely contributing noise.

Backtesting: Trust Nothing

The final and most important piece is honest backtesting. Lookahead bias is the silent killer of trading ML — it sneaks in through feature normalization, target labeling, and data alignment. Every feature must be computed using only data available at prediction time. I enforce this with a strict point_in_time wrapper that raises exceptions if any future data is accessed.

Transaction costs, slippage, and market impact must be modeled realistically. A strategy that shows 40% annual returns in backtesting with zero-cost fills might be negative after accounting for a 5bps per-trade cost on crypto.

The gap between backtesting and live performance is where most ML trading systems die. The ones that survive are built with paranoid attention to data integrity, honest evaluation, and robust risk management layered on top of the model’s predictions.