← Back to Learn
III AdvancedWeek 18 • Lesson 55Duration: 55 min

CI Cointegration

The real math behind pairs trading — correlation is not enough

Learning Objectives

  • Understand why correlation and cointegration are fundamentally different
  • Learn the Engle-Granger and Johansen cointegration tests
  • Apply cointegration to validate trading pairs rigorously

Explain Like I'm 5

Correlation means two assets move in the same direction. Cointegration means they move TOGETHER with a stable spread. The difference is massive: correlation can be sky-high while the spread drifts forever. Cointegration guarantees the spread reverts. For pairs trading, cointegration beats correlation. Always.

Think of It This Way

Think of a drunk person walking their dog. The person and dog are correlated (they go in roughly the same direction). But they're also cointegrated — the leash keeps them within a bounded distance. The "spread" (distance between them) mean-reverts because of the leash. In markets, economic relationships are the leash. This is arguably the best analogy in all of quantitative finance.

1Correlation vs. Cointegration

This is the most important distinction in statistical arbitrage. Let me break it down precisely: Correlation: prices move in the same direction at the same time. Cointegration: there exists a linear combination that is stationary. Why this matters in practice: Correlated but NOT cointegrated: AAPL and MSFT. They both go up long-term. Correlation is 0.90+. But the spread can drift — Apple could outperform Microsoft by 500% over 10 years. You'd get destroyed pairs trading this. Cointegrated: EUR/USD and GBP/USD. Both driven by similar macro factors (USD strength). The spread is mean-reverting — it widens and narrows around a stable mean. THIS is what you want. Technical definition: XX and YY are cointegrated if there exists β\beta such that:
XtβYt=stationary processX_t - \beta \cdot Y_t = \text{stationary process}
Stationary means mean-reverting. Mean-reverting means bounded spread. Bounded spread means tradeable. That's the whole game.

Correlation vs Cointegration: The Critical Difference

2Testing for Cointegration

Two main methods for testing whether two assets are genuinely cointegrated: Engle-Granger two-step method: 1. Regress XX on YY: Xt=α+βYt+εtX_t = \alpha + \beta Y_t + \varepsilon_t 2. Test residuals εt\varepsilon_t for stationarity (ADF test) 3. If residuals are stationary, the pair is cointegrated Critical caveat: the critical values are different from standard ADF (use MacKinnon values). Most people mess this up by using regular ADF critical values. Johansen test: - Tests for cointegration among multiple time series simultaneously - More powerful than Engle-Granger when you have more than 2 variables - Returns the number of cointegrating relationships For production stat arb systems: - Use Engle-Granger for pair selection (simple, fast, sufficient) - Require p-value < 0.05 AND stable hedge ratio - Re-test cointegration monthly (relationships can and do break) Pro tip: even if a pair passes the cointegration test, check the half-life of mean reversion. If the half-life exceeds 30 days, the trade takes too long to converge for short-timeframe strategies.

3Half-Life: The Speed of Mean Reversion

Cointegration tells you IF the spread reverts. Half-life tells you HOW FAST. Half-life = the time it takes for the spread to revert halfway to its mean. Shorter is better. How to compute it: 1. Take the spread residuals εt\varepsilon_t from the Engle-Granger regression 2. Regress Δεt\Delta\varepsilon_t on εt1\varepsilon_{t-1}: Δεt=θεt1+noise\Delta\varepsilon_t = \theta \cdot \varepsilon_{t-1} + \text{noise} 3. Half-life =ln(2)/ln(1+θ)= -\ln(2) / \ln(1 + \theta) What's a good half-life? - < 10 bars: Excellent, fast reversion — ideal scenario - 10-30 bars: Good, tradeable on most timeframes - 30-60 bars: Acceptable, only for longer holding periods - > 60 bars: Probably too slow for active trading The chart below shows how different half-lives affect convergence speed. A half-life of 5 snaps back almost immediately, while 50 takes far longer. You want to be on the left side of this chart.

Mean Reversion Speed by Half-Life

4Common Cointegration Mistakes

Five mistakes I see repeatedly with cointegration: 1. Confusing correlation with cointegration. Correlation = 0.95 does NOT mean cointegrated. I cannot stress this enough. Test it properly. 2. Using too short a lookback. Testing cointegration on 50 bars is meaningless. You need at least 250+ observations for the test to have any statistical power. 3. Ignoring structural breaks. A pair that was cointegrated for 5 years can break. Central bank policy changes, regulatory shifts, or market structure changes can permanently alter relationships. 4. Not re-testing. Cointegration isn't permanent. Re-test monthly at minimum. If the p-value starts creeping up toward 0.10, start reducing exposure. 5. Data snooping. Testing 100 pairs and finding 5 that pass is NOT the same as finding 5 economically-motivated pairs that happen to pass. The economic relationship should come FIRST; the test should CONFIRM it. Number 5 is the most insidious. You can always find statistical artifacts in enough data. The question is whether the relationship makes economic sense.

Key Formulas

Engle-Granger Regression

Step 1: regress X on Y. Step 2: test residuals for stationarity. If ADF rejects the unit root, the residuals are stationary and X and Y are cointegrated.

Half-Life of Mean Reversion

Where theta is the coefficient from regressing delta-epsilon on epsilon_{t-1}. Shorter half-life means faster mean reversion and more tradeable pairs.

Hands-On Code

Cointegration Testing

python
import numpy as np
from statsmodels.tsa.stattools import coint

def test_cointegration(price_a, price_b, names=('A', 'B')):
    """Test for cointegration between two price series."""
    score, p_value, _ = coint(price_a, price_b)
    
    print(f"=== COINTEGRATION: {names[0]} & {names[1]} ===")
    print(f"Engle-Granger test stat: {score:.4f}")
    print(f"p-value: {p_value:.4f}")
    print(f"  {'[PASS] COINTEGRATED' if p_value < 0.05 else '[FAIL] NOT cointegrated'}")
    
    if p_value < 0.05:
        from sklearn.linear_model import LinearRegression
        model = LinearRegression()
        log_a, log_b = np.log(price_a), np.log(price_b)
        model.fit(log_b.reshape(-1, 1), log_a)
        beta = model.coef_[0]
        
        spread = log_a - beta * log_b
        spread_lag = spread[:-1]
        spread_diff = np.diff(spread)
        
        theta_model = LinearRegression()
        theta_model.fit(spread_lag.reshape(-1, 1), spread_diff)
        theta = theta_model.coef_[0]
        
        if -1 < theta < 0:
            half_life = -np.log(2) / np.log(1 + theta)
            print(f"  Half-life: {half_life:.1f} periods")
            print(f"  Hedge ratio: {beta:.4f}")
            if half_life < 50:
                print(f"  [PASS] Tradeable half-life")
            else:
                print(f"  [WARN] Slow mean reversion")
    
    return p_value < 0.05, p_value

Tests for cointegration between two price series using the Engle-Granger method, then computes the hedge ratio and half-life to assess tradeability.

Knowledge Check

Q1.Two assets have correlation of 0.95 but fail the cointegration test. Should you pairs trade them?

Assignment

Test cointegration for 5-10 instrument pairs in your trading universe. Compute half-lives. Identify the best 2-3 pairs for trading. Verify that cointegration holds in a walk-forward framework.