← Back to Learn
III AdvancedWeek 7 • Lesson 19Duration: 50 min

SEL Model Selection & Validation

How to choose between models without fooling yourself

Learning Objectives

  • Understand walk-forward validation and why k-fold fails for time series
  • Learn proper model comparison methodology
  • See the dangers of optimizing for the wrong metric
  • Build a model selection workflow you can actually trust

Explain Like I'm 5

You've trained three models. They all look good on paper. How do you pick the best one? Not by looking at training accuracy — that's like judging a student by their open-book exam. You need a proper validation setup that simulates real deployment conditions. If the model can't perform on data it's never seen, it's useless.

Think of It This Way

Choosing a model is like hiring an employee. Their resume (training performance) tells you what they CAN do. The interview (validation) tests what they ACTUALLY do under unfamiliar conditions. Reference checks (out-of-sample testing) confirm they're not just good at interviews.

1Why K-Fold Cross-Validation Doesn't Work for Trading

K-fold cross-validation shuffles data randomly into folds. For trading data, this is fundamentally broken. Why? Because financial data is temporal. Events at time t+1 depend on events at time t. When you randomly shuffle, fold 3 might contain data from 2023 that's used to validate a model trained on fold 4 containing data from 2024. You're literally training on the future and testing on the past. The result: inflated performance estimates that don't hold in production. Walk-forward validation fixes this: 1. Train on Jan 2019 – Dec 2021 2. Test on Jan 2022 – Jun 2022 (gap of 0 or N bars for purging) 3. Retrain on Jan 2019 – Jun 2022 4. Test on Jul 2022 – Dec 2022 5. Repeat Each test period comes strictly AFTER the training period. No future leakage. This simulates what actually happens in production: you train on all available past data and trade on unseen future data. Bailey, D.H. et al. (2014). "The Probability of Backtest Overfitting." Journal of Computational Finance.

Apparent Performance: K-Fold vs Walk-Forward

2Choosing the Right Metric

Accuracy is the default metric and it's wrong for trading. Why? Because accuracy weights all predictions equally. A model that correctly predicts 60% of trades but always catches the big winners is far more valuable than one that correctly predicts 60% of trades but misses all the big moves. Better metrics for trading models: AUC-ROC — Area under the receiver operating characteristic curve. Measures discrimination ability across all thresholds. Good for comparing model quality, less useful for setting production thresholds. Profit Factor — Gross profits / gross losses. Directly measures whether the model makes money. Want > 1.5 for production. Total R — Sum of all trade returns in risk units. The bottom line. Does following this model's signals make money? Sharpe Ratio — Risk-adjusted return. A model that makes 5R per month with 1R standard deviation (Sharpe≈5) is far better than one that makes 8R with 4R standard deviation (Sharpe≈2). Maximum Drawdown — Worst peak-to-trough decline. A model with great returns but 15% max drawdown might breach a funded account. This is a constraint, not just a metric.

Model A vs Model B — Same Accuracy, Very Different Value

3The Model Selection Workflow

Here's the process that actually works: Step 1: Define your comparison metric upfront. Before you train anything, decide what "best" means. Profit factor? Sharpe? Total R? Pick one primary metric and 2-3 secondary constraints (e.g., max drawdown < 8%). Step 2: Train all candidate models on the same data split. Same training period, same validation period, same features. The only variable should be the model itself. Step 3: Compare on walk-forward out-of-sample data. Not validation data — genuinely held-out data that you haven't touched during development. Step 4: Run multiple walk-forward windows. One out-of-sample period isn't enough. If Model A beats Model B in 4 out of 5 windows, that's more convincing than winning in 1 out of 1. Step 5: Statistical significance. Is the difference real or noise? Run a paired t-test on the per-window differences. If p > 0.05, the models aren't meaningfully different — pick the simpler one. Step 6: Stress test the winner. Monte Carlo simulation. Spread widening. Drawdown scenarios. Does the winning model survive adverse conditions?

4Occam's Razor — When Models Tie

When two models perform similarly, always pick the simpler one. This isn't aesthetics — it's pragmatic. Simpler models: - Are less likely to be overfit - Are easier to debug in production - Are faster to retrain - Have fewer failure modes - Are easier to explain to stakeholders A complex stacked ensemble that beats XGBoost by 0.3% accuracy is probably not worth the additional infrastructure and maintenance cost. That 0.3% is likely noise anyway. The real world is messy. Spread changes, data feeds lag, features arrive late. Simple models degrade gracefully under these conditions. Complex models break spectacularly.

Model Complexity vs Robustness

Key Formulas

Profit Factor

Gross profits divided by gross losses. PF > 1 = profitable. PF > 1.5 = solid. PF > 2.0 = excellent. This directly measures whether the model makes money, unlike accuracy.

Sharpe Ratio (Annualized)

Risk-adjusted return. Mean excess return divided by standard deviation, annualized. Higher = better risk-adjusted performance. S > 2 is considered very good for trading strategies.

Hands-On Code

Walk-Forward Model Comparison

python
import numpy as np

def walk_forward_compare(models, features, labels, window_size=252,
                         step_size=63):
    """Compare models using walk-forward validation."""
    n = len(features)
    results = {name: [] for name in models}
    
    for start in range(window_size, n - step_size, step_size):
        train_end = start
        test_end = start + step_size
        
        X_train = features[:train_end]
        y_train = labels[:train_end]
        X_test = features[train_end:test_end]
        y_test = labels[train_end:test_end]
        
        for name, model_fn in models.items():
            model = model_fn()
            model.fit(X_train, y_train)
            preds = model.predict(X_test)
            
            # Calculate profit factor
            wins = preds[y_test == 1].sum()
            losses = abs(preds[y_test == 0].sum()) + 1e-10
            pf = wins / losses
            results[name].append(pf)
    
    # Compare across windows
    for name, pfs in results.items():
        print(f"{name}: PF = {np.mean(pfs):.2f} ± {np.std(pfs):.2f}"
              f" (wins {sum(1 for p in pfs if p > 1)}/{len(pfs)} windows)")

Each model is tested on multiple non-overlapping forward windows. Comparing average profit factor AND consistency (how many windows are profitable) gives a fair assessment.

Knowledge Check

Q1.Why does k-fold cross-validation overestimate trading model performance?

Assignment

Take two models (e.g., XGBoost and Random Forest) and compare them using walk-forward validation over at least 4 non-overlapping test windows. Report profit factor for each window and overall. Is the difference statistically significant? Would you trust the winner?