OFD Overfitting Detection Methods
Beyond PBO — a toolkit for catching overfit strategies
Learning Objectives
- •Master multiple overfitting detection techniques
- •Learn to use deflated Sharpe ratio
- •Build a full anti-overfitting validation pipeline
Explain Like I'm 5
PBO is a great tool, but it's one tool. You need a whole toolkit to catch overfitting. Deflated Sharpe ratio, minimum backtest length, performance degradation analysis — each catches different types of overfitting. Use all of them. If your strategy passes every test, you can be far more confident it's real.
Think of It This Way
One medical test might miss a disease. That's why doctors order multiple tests — each catches different things. Same with overfitting detection. PBO catches multiple testing bias, deflated Sharpe catches short backtests, degradation analysis catches structural breaks. Use the full battery.
1Deflated Sharpe Ratio
2Performance Degradation Analysis
3Bootstrap Confidence Intervals
Bootstrap Distribution: Win Rate Estimates (10k samples)
4Minimum Backtest Length
Key Formulas
Deflated Sharpe Ratio
Sharpe ratio adjusted for the number of strategies tried (N_trials). If you tried many strategies, the deflation is large. A raw Sharpe of 2.0 can deflate to 0.5 with enough trials.
Hands-On Code
Overfitting Detection Suite
import numpy as np
from scipy import stats
def overfitting_battery(oos_returns, n_strategies_tested):
"""Full overfitting detection battery."""
returns = np.array(oos_returns)
# 1. Basic stats
sharpe = returns.mean() / returns.std() * np.sqrt(252)
print(f"=== OVERFITTING DETECTION ===")
print(f"Raw Sharpe: {sharpe:.2f}")
# 2. Deflated Sharpe (simplified)
sr_std = np.sqrt((1 + 0.5 * sharpe**2) / len(returns))
threshold = sr_std * stats.norm.ppf(1 - 1 / n_strategies_tested)
deflated = sharpe - threshold
print(f"Deflated Sharpe: {deflated:.2f} ({n_strategies_tested} tested)")
print(f" {'[PASS] Significant' if deflated > 0 else '[FAIL] Likely overfit'}")
# 3. Performance trend
n = len(returns)
first_half = returns[:n // 2].mean()
second_half = returns[n // 2:].mean()
degradation = (second_half - first_half) / abs(first_half) * 100
print(f"Performance trend: {degradation:+.1f}%")
print(f" {'[PASS] Stable' if abs(degradation) < 20 else '[WARN] Degrading'}")
# 4. Minimum sample check
min_years = max(2, (1 / sharpe)**2 * 2)
actual_years = len(returns) / 252
print(f"Min backtest: {min_years:.1f}y | Actual: {actual_years:.1f}y")
print(f" {'[PASS] Sufficient' if actual_years >= min_years else '[FAIL] Too short'}")Run the full battery before deploying. If ANY test raises a flag, investigate further. Deploy only when all tests pass.
Knowledge Check
Q1.Your strategy shows Sharpe 2.5 but you tested 200 configurations. The deflated Sharpe is 0.1. What does this mean?
Assignment
Run the full overfitting detection battery on your strategy: PBO, deflated Sharpe, performance degradation analysis, and minimum sample check. Document results and make a deploy/no-deploy decision.