← Back to Learn
II IntermediateWeek 26 • Lesson 73Duration: 45 min

PIPE Data Pipelines

The unglamorous plumbing that makes or breaks your entire trading system

Learning Objectives

  • Understand data pipeline architecture for trading systems
  • Learn data quality, validation, and error handling
  • Build a reliable data ingestion pipeline

Explain Like I'm 5

This is the least exciting topic in quantitative finance, and also the most important. Garbage in, garbage out — literally. If your data is wrong, your model is wrong, your trades are wrong, and your money is gone. Data pipelines are the plumbing: getting data from sources (broker API, data vendors), cleaning it (filling gaps, removing bad ticks), storing it (databases), and serving it to your model. Nobody posts about their data pipeline on social media, but every production system that actually works has a solid one.

Think of It This Way

A data pipeline is the plumbing in a house. Nobody thinks about plumbing until they're standing in sewage water. But without clean water (data) flowing reliably to every faucet (model), the whole house (trading system) is useless. The best quants I know? They're secretly plumbing enthusiasts.

1Pipeline Architecture

Here's what a production data flow actually looks like: 1. Ingestion: your broker connector fetches OHLCV data via API - get_rates(symbol, timeframe, bars) -> DataFrame - Real-time tick data for live signals - This sounds simple but connection drops, rate limits, and API quirks will humble you fast 2. Validation: check data quality before using it - No missing bars? (gaps in timeframe = incorrect features) - Reasonable values? (no negative prices, no 50% candle wicks) - Consistent timestamps? (timezone issues are very real) 3. Feature engineering: compute all input features from raw OHLCV - Momentum indicators: RSI, MACD, ADX - Volatility measures: ATR, Bollinger width - Regime indicators: Hurst exponent, autocorrelation - This step is where most bugs hide 4. Storage: cache processed data to avoid recomputing - Raw data goes to historical storage - Processed features get cached - Live data goes to a hot store 5. Serving: deliver features to your model for prediction - Real-time: newest bar features fed to model - Batch: historical features for backtesting Every single step needs error handling, logging, and validation. Skip any one and you WILL get burned.

2Data Quality Issues

These problems WILL happen to you. Not if — when: Missing bars: Market was closed (holidays), broker had an outage, or your connection dropped. Solution: detect gaps, forward-fill or skip. Never pretend they don't exist. Bad ticks: Some broker sends a price that's 10% off because of a glitchy quote. One bad tick can wreck your entire feature set. Solution: filter ticks beyond 5 standard deviations from rolling mean. Stale data: Connection dropped but your system still shows the last price as if everything's fine. Solution: check timestamp freshness. If data is older than expected, flag it immediately. Timezone chaos: Your broker uses EET, your server is UTC, your logs are in local time. Misalignment = wrong features = garbage predictions. Solution: convert EVERYTHING to UTC internally. No exceptions. Survivorship bias: Instruments get delisted or renamed. Your historical data pretends they never existed. Solution: use point-in-time data when available. Look-ahead bias: Accidentally using future data in backtests. THE most common backtest error and the most dangerous — your backtest looks amazing while being completely fake. Solution: strict temporal ordering. Never use data from time t+1 for decisions at time t.

3The Paper Trading Reality Check

There's a pattern that happens to virtually every new quant: your backtest looks incredible, you go to paper trading, and suddenly performance drops. Why? Latency: In backtest, you execute at exact prices. In live/paper, there's a delay between signal and fill. Even 100ms matters on volatile instruments. Slippage: You wanted to buy at 1.0850? You got filled at 1.0852. Multiply that by 600 trades a year and it adds up significantly. Data differences: Your backtest used clean historical data. Live data has gaps, spikes, and connection drops that your backtest never encountered. Market impact: Your order slightly moves the market (less relevant for retail size, but still real for illiquid instruments). A 10-20% performance degradation from backtest to paper is NORMAL. If your paper results are substantially worse than that, you probably have a bug, not bad luck. The gap tells you how realistic your backtest assumptions actually were.

Typical Performance Gap: Backtest vs Paper vs Live

4Latency and Where It Hides

Latency is the silent killer of live trading systems. Here's where it hides: Data fetch: 50-500ms depending on API and connection quality. Production systems cache aggressively and use websockets instead of polling. Feature computation: 10-100ms for a typical feature set. Vectorized numpy is your friend. Python loops are your enemy. Model inference: 1-50ms depending on model complexity. XGBoost is blazing fast. Deep learning models are slower but still adequate for M15 timeframes. Order submission: 50-200ms to your broker. This is network latency — not much you can do except colocate (overkill for M15 trading). Total round trip: Realistically 200-800ms for a full cycle. For M15 bars you have 15 minutes, so this is entirely manageable. But for tick-by-tick strategies, every millisecond counts. The key insight: measure your actual latencies. Don't guess. Log timestamps at each step and you'll know exactly where your bottleneck is.

Latency Distribution by Component (ms)

5Handling Vendor Lock-in

Here's something nobody tells beginners: your first data source won't be your last. Brokers change APIs, data vendors raise prices, exchanges deprecate endpoints. The fix: abstraction layers. Wrap your data source behind a clean interface: - DataProvider.get_bars(symbol, timeframe, count) -> DataFrame - DataProvider.get_tick(symbol) -> Tick - DataProvider.is_connected() -> bool Now you can swap MT5 for Interactive Brokers, or replace a CSV reader with a database — and your pipeline doesn't care. The rest of your system talks to the interface, not the implementation. This feels like over-engineering when you start. But the first time you need to switch brokers and it takes 1 hour instead of 2 weeks, you'll thank past-you.

6Common Pipeline Disasters

Every production quant has a horror story. Learn from theirs: Disaster 1: The infinite loop. Your pipeline retries on error. The error is permanent (API key expired). Pipeline retries 10,000 times per second, gets rate-limited, gets banned. Solution: exponential backoff + max retries. Disaster 2: The silent failure. Pipeline returns cached data because the live connection is down. Your model trades on stale prices for 3 hours before anyone notices. Solution: timestamp checks on every data read. Disaster 3: The schema change. Your data vendor changes column names or date format. Your parser breaks. Features go to NaN. Model predicts garbage. Solution: validate schema at ingestion, fail loudly. Disaster 4: The timezone catastrophe. Daylight savings time change. Your pipeline is off by 1 hour. Features computed at wrong bars. Signals shifted by one bar. Trades are wrong for 12 hours before you notice. Solution: always use UTC. Never trust local time. The common thread: fail LOUDLY, not silently. A crash is better than wrong data. You can fix a crash immediately. Wrong data can go undetected for weeks.

Hands-On Code

Data Pipeline with Validation

python
import numpy as np
import pandas as pd

class DataPipeline:
    """Simple data pipeline with validation."""
    
    def __init__(self, max_gap_bars=5, spike_threshold=5.0):
        self.max_gap_bars = max_gap_bars
        self.spike_threshold = spike_threshold
    
    def validate(self, df: pd.DataFrame) -> dict:
        """Validate OHLCV data quality."""
        issues = []
        
        # Check for missing bars
        if hasattr(df.index, 'freq') and df.index.freq:
            expected = pd.date_range(df.index[0], df.index[-1], freq=df.index.freq)
            missing = len(expected) - len(df)
            if missing > 0:
                issues.append(f"[WARN] {missing} missing bars detected")
        
        # Check for negative prices
        if (df[['open', 'high', 'low', 'close']] <= 0).any().any():
            issues.append("[ALERT] Negative or zero prices found")
        
        # Check OHLC consistency
        if (df['high'] < df['low']).any():
            issues.append("[ALERT] High < Low detected")
        
        # Check for spikes
        returns = df['close'].pct_change()
        spike_mask = returns.abs() > self.spike_threshold * returns.std()
        if spike_mask.any():
            issues.append(f"[WARN] {spike_mask.sum()} price spikes detected")
        
        # Check volume
        if (df['volume'] == 0).mean() > 0.1:
            issues.append(f"[WARN] {(df['volume']==0).mean():.0%} zero-volume bars")
        
        status = "[PASS] CLEAN" if not issues else "[WARN] ISSUES FOUND"
        print(f"Data validation: {status}")
        for issue in issues:
            print(f"  {issue}")
        
        return {'clean': len(issues) == 0, 'issues': issues}

Validates data BEFORE feeding it to models. Catching data errors early prevents garbage predictions and bad trades. This is not optional — skip this and the consequences are predictable.

Knowledge Check

Q1.Your backtest shows amazing results but uses close prices to compute indicators and generate signals at the same bar's close. What's the problem?

Assignment

Build a data validation pipeline that checks for: missing bars, price spikes, OHLC consistency, and look-ahead bias. Run it on your historical data and fix any issues found.