← Back to Learn
IV ExpertWeek 14 • Lesson 42Duration: 45 min

STM Stress Testing ML Models

Breaking your models on purpose to find weaknesses

Learning Objectives

  • Learn to stress test ML models specifically (not just strategies)
  • Understand adversarial inputs and model fragility
  • Build resilient models that handle edge cases gracefully

Explain Like I'm 5

Stress testing a strategy tells you if the strategy breaks. Stress testing a model tells you if the underlying ML model breaks. Different thing. What happens if a feature has a value the model never saw in training? What if data is missing? What if the distribution shifts? This is where production systems earn their keep.

Think of It This Way

Stress testing strategies is like testing a car on a rough road. Stress testing models is like testing the engine specifically — what happens at extreme temperatures, low fuel, dirty oil? The engine can fail even if the road is smooth.

1Model Fragility Points

ML models can break in ways that aren't obvious: 1. Out-of-distribution inputs. If RSI has always been 20-80 during training, what happens when it's 5 or 95? Some models extrapolate badly, producing extreme or nonsensical predictions. 2. Missing features. If a data feed drops one feature, does the model crash or degrade gracefully? 3. Feature distribution shift. If volatility suddenly spikes to 3x anything in training, feature values will be extreme. How does the model handle this? 4. Temporal distribution shift. Market regimes change. A model trained on trending data might fail completely in ranging markets. 5. Adversarial inputs. Specifically crafted inputs that exploit model weaknesses. Less relevant for trading (the market doesn't intentionally attack your model), but worth understanding. Defenses include: feature clipping to training range, NaN fallback handling, regime detection that adapts strategy behavior, and regular retraining to capture distribution shifts.

2The Stress Test Playbook

Here's the actual playbook. Run ALL of these before deploying anything: Test 1: Extreme feature values. Take each feature to 3x its training max. Does the model's prediction change wildly? Flag sensitive features. Test 2: NaN injection. Set each feature to NaN one at a time. Does the model crash or handle it? If it crashes, you need defensive code. Test 3: All-zero input. Feed the model zeros for everything. What does it predict? It should output something reasonable (near 0.5 for a binary classifier). Test 4: Correlated feature perturbation. Shift ALL momentum features simultaneously (simulating a regime change). Does the model handle it or produce extreme outputs? Test 5: Time-shifted data. Feed the model data from a completely different period. Does it give reasonable lower-confidence outputs, or does it break?

Stress Test Results: Model Sensitivity by Feature Group

3Defensive Code Patterns

The stress tests will reveal fragility. Here's how to fix it: Feature clipping. Before feeding features to the model, clip them to [min_train, max_train]. This prevents extrapolation on extreme values that the model has never seen. NaN replacement. For each feature, define a safe default (usually the training median). If a feature is NaN, substitute the default. Never let NaN reach the model. Confidence gating. If the model's prediction confidence is below 0.52 (barely above random), don't trade. Low confidence often signals out-of-distribution inputs. Sanity checks on output. Model says probability = 0.999 for LONG? That's suspicious. Cap confidence at 0.85 or similar. Extreme confidence usually means the model is extrapolating badly. These patterns add maybe 20 lines of code but prevent catastrophic failures in production. Highest ROI code you'll ever write.

4Production Failure Patterns

Real patterns from production model failures: The volatility spike. Model trained on 2023-2024 data (low vol). A volatility spike sent ATR to 4x the training maximum. Model's predictions went haywire, generating 12 trades in 3 hours. Fix: feature clipping + max-trades-per-day limit. The data feed hiccup. Broker's data feed went down for 30 seconds. Model received NaN for spread-related features. Prediction pipeline crashed. Open position left without exit management for 45 minutes. Fix: NaN handling + automatic flat-mode on data failure. The regime change. Model trained during strong USD trend. USD flattened. Model kept predicting trending continuation, losing 15 trades in a row. Fix: regime detection flags non-trend environments and reduces position size. The common thread: every one of these failures was something stress testing would have caught. Test before you deploy, not after you lose money. Taleb, N.N. (2010). "The Black Swan: The Impact of the Highly Improbable." Random House.

Hands-On Code

ML Model Stress Test Suite

python
import numpy as np

def stress_test_model(model, X_test, feature_names):
    """Full model stress test suite."""
    print("=== ML MODEL STRESS TESTS ===")
    
    # 1. Extreme values
    print("\n1. EXTREME VALUES")
    base_pred = model.predict_proba(X_test[:1])[:, 1][0]
    for i, name in enumerate(feature_names[:5]):
        X_ext = X_test[:1].copy()
        X_ext[0, i] = X_test[:, i].max() * 3
        ext_pred = model.predict_proba(X_ext)[:, 1][0]
        change = abs(ext_pred - base_pred)
        status = '[WARN]' if change > 0.2 else '[PASS]'
        print(f"  {name} at 3x: change {change:.3f} {status}")
    
    # 2. Missing values
    print("\n2. MISSING VALUES")
    try:
        X_nan = X_test[:1].copy()
        X_nan[0, 0] = np.nan
        model.predict(X_nan)
        print("  [PASS] Handles NaN gracefully")
    except Exception:
        print("  [FAIL] Crashes on NaN input")
    
    # 3. All-zero input
    print("\n3. ALL-ZERO INPUT")
    X_zero = np.zeros_like(X_test[:1])
    pred = model.predict_proba(X_zero)[:, 1][0]
    print(f"  Prediction: {pred:.3f}")
    
    # 4. Prediction range check
    print("\n4. PREDICTION RANGE")
    preds = model.predict_proba(X_test)[:, 1]
    extremes = (preds > 0.99).sum() + (preds < 0.01).sum()
    print(f"  Range: [{preds.min():.3f}, {preds.max():.3f}]")
    print(f"  Extreme predictions: {extremes}")

Stress test your model, not just your strategy. These tests reveal fragility that only appears with unusual inputs — exactly the inputs that show up during market crises.

Knowledge Check

Q1.Your model crashes when a feature is NaN. In production, a data feed drops out causing NaN values. What happens?

Assignment

Run the full model stress test suite on your trained model. Identify fragility points. Implement defensive code (feature clipping, NaN handling, output sanity checks) for each identified weakness.