← Back to Learn
IV ExpertWeek 12 • Lesson 34Duration: 50 min

OFD Overfitting Detection Methods

Beyond PBO — a toolkit for catching overfit strategies

Learning Objectives

  • Master multiple overfitting detection techniques
  • Learn to use deflated Sharpe ratio
  • Build a full anti-overfitting validation pipeline

Explain Like I'm 5

PBO is a great tool, but it's one tool. You need a whole toolkit to catch overfitting. Deflated Sharpe ratio, minimum backtest length, performance degradation analysis — each catches different types of overfitting. Use all of them. If your strategy passes every test, you can be far more confident it's real.

Think of It This Way

One medical test might miss a disease. That's why doctors order multiple tests — each catches different things. Same with overfitting detection. PBO catches multiple testing bias, deflated Sharpe catches short backtests, degradation analysis catches structural breaks. Use the full battery.

1Deflated Sharpe Ratio

The regular Sharpe ratio doesn't account for how many strategies you tried before finding this one. Deflated Sharpe adjusts for: - Number of strategies tested - Skewness of returns - Kurtosis of returns - Backtest length A strategy with Sharpe 1.5 might have a deflated Sharpe of 0.3 if you tested 100 strategies to find it. The deflated version is the "honest" Sharpe — what remains after accounting for selection bias. Rule of thumb: if deflated Sharpe < 0, your strategy is likely overfit regardless of how impressive the raw Sharpe looks. You haven't found skill — you've found the luckiest coin in the jar. Computing Sharpe on walk-forward OOS data only provides a natural deflation. But computing the formal deflated Sharpe provides additional confidence. Bailey, D.H. & López de Prado, M. (2014). "The Deflated Sharpe Ratio." Journal of Portfolio Management.

2Performance Degradation Analysis

One of the simplest and most powerful tests: plot OOS performance over time. If OOS performance is stable across walk-forward windows → model captures persistent patterns. ✅ If OOS performance degrades over time → model is losing predictive power. ⚠️ If OOS performance is wildly variable → model captures noise, not signal. ❌ Also check whether the model performs similarly across: - Market regimes (trending vs. ranging) - Volatility environments (high vs. low) - Time of year (Q1 vs. Q4) Consistent performance across conditions is strong evidence of a reliable model. Performance that only works in specific conditions is fragile — and those conditions might not persist.

3Bootstrap Confidence Intervals

Here's another powerful technique: bootstrap confidence intervals on your key metrics. 1. Take your trade results (say 1,000 trades) 2. Randomly resample WITH replacement to create a new set of 1,000 trades 3. Compute your metric (win rate, Sharpe, etc.) on the resampled data 4. Repeat steps 2-3 ten thousand times 5. Sort all 10,000 computed metrics 6. The 2.5th and 97.5th percentile values = your 95% confidence interval If your 95% CI for win rate is [54%, 64%], you can be reasonably confident the true WR is somewhere in there. If it's [45%, 73%], you don't know much — the interval is too wide. Narrow CI = reliable estimate. Wide CI = you need more data. This is one of the simplest and most useful statistical tools you'll ever learn.

Bootstrap Distribution: Win Rate Estimates (10k samples)

4Minimum Backtest Length

How long must your backtest be to draw reliable conclusions? Bailey & López de Prado (2014) provide a formula based on the Sharpe ratio and data frequency: - For a strategy with Sharpe ≈ 1.5 on daily data: minimum ~2 years - For monthly data: minimum ~5 years - For weekly data: minimum ~3 years Short backtests are the number one cause of false confidence. "I tested 6 months and got great results!" means almost nothing. Six months can show excellent results by chance alone — the confidence intervals are simply too wide. A production system using 7+ years of data is well above minimum requirements across most frequency levels. This is one of the foundations of reliable results.

Key Formulas

Deflated Sharpe Ratio

Sharpe ratio adjusted for the number of strategies tried (N_trials). If you tried many strategies, the deflation is large. A raw Sharpe of 2.0 can deflate to 0.5 with enough trials.

Hands-On Code

Overfitting Detection Suite

python
import numpy as np
from scipy import stats

def overfitting_battery(oos_returns, n_strategies_tested):
    """Full overfitting detection battery."""
    returns = np.array(oos_returns)
    
    # 1. Basic stats
    sharpe = returns.mean() / returns.std() * np.sqrt(252)
    print(f"=== OVERFITTING DETECTION ===")
    print(f"Raw Sharpe: {sharpe:.2f}")
    
    # 2. Deflated Sharpe (simplified)
    sr_std = np.sqrt((1 + 0.5 * sharpe**2) / len(returns))
    threshold = sr_std * stats.norm.ppf(1 - 1 / n_strategies_tested)
    deflated = sharpe - threshold
    print(f"Deflated Sharpe: {deflated:.2f} ({n_strategies_tested} tested)")
    print(f"  {'[PASS] Significant' if deflated > 0 else '[FAIL] Likely overfit'}")
    
    # 3. Performance trend
    n = len(returns)
    first_half = returns[:n // 2].mean()
    second_half = returns[n // 2:].mean()
    degradation = (second_half - first_half) / abs(first_half) * 100
    print(f"Performance trend: {degradation:+.1f}%")
    print(f"  {'[PASS] Stable' if abs(degradation) < 20 else '[WARN] Degrading'}")
    
    # 4. Minimum sample check
    min_years = max(2, (1 / sharpe)**2 * 2)
    actual_years = len(returns) / 252
    print(f"Min backtest: {min_years:.1f}y | Actual: {actual_years:.1f}y")
    print(f"  {'[PASS] Sufficient' if actual_years >= min_years else '[FAIL] Too short'}")

Run the full battery before deploying. If ANY test raises a flag, investigate further. Deploy only when all tests pass.

Knowledge Check

Q1.Your strategy shows Sharpe 2.5 but you tested 200 configurations. The deflated Sharpe is 0.1. What does this mean?

Assignment

Run the full overfitting detection battery on your strategy: PBO, deflated Sharpe, performance degradation analysis, and minimum sample check. Document results and make a deploy/no-deploy decision.