II IntermediateWeek 2 • Lesson 6Duration: 60 min

HYP Hypothesis Testing for Traders

How to prove your strategy isn't just luck

Learning Objectives

•Formulate null and alternative hypotheses for trading strategies
•Understand Type I and Type II errors in a trading context
•Apply t-tests and permutation tests to validate trading signals
•Interpret p-values correctly and avoid the classic misinterpretations

Explain Like I'm 5

Hypothesis testing asks one question: "Is my strategy actually good or did I just get lucky?" It's like someone claiming they can shoot free throws at 70%. You'd say "prove it — shoot 100." If they hit 72, maybe it's real. If they hit 55, maybe they got lucky early. Hypothesis testing is the math behind "prove it."

Think of It This Way

Think of hypothesis testing like a courtroom trial. The null hypothesis is "innocent until proven guilty" — your strategy has no edge. The alternative is "guilty" — there is an edge. You need enough evidence to convict beyond reasonable doubt (p < 0.01). Weak evidence? Strategy walks free. You don't trade it.

1The Framework: H₀ vs H₁

Every hypothesis test starts with two competing ideas: H₀ (null hypothesis): "This strategy has zero edge. Any profits came from random chance." H₁ (alternative hypothesis): "This strategy has a genuine statistical edge." You assume H₀ is true, then calculate how unlikely your observed results would be under that assumption. If they're really unlikely (p < 0.01), you reject H₀ and accept that the edge is probably real. Here's the key insight — and it trips people up constantly: you never prove H₁. You only show that H₀ is implausible given the data. It's like saying "I can't prove this coin is rigged, but after 80 heads in 100 flips, I'm not betting on it being fair." Most retail traders skip this step entirely. They look at a backtest, see green numbers, and go live. That's how you lose money. The hypothesis test is the firewall between you and your own overconfidence.

2Type I and Type II Errors

Two ways to get it wrong: Type I error (false positive): You conclude the strategy has edge when it doesn't. You trade it live. It bleeds money. This is the most common and expensive mistake in quant finance. Type II error (false negative): You conclude the strategy has no edge when it actually does. You miss out on profits. Annoying, but not fatal — you still have your capital. In trading, Type I errors are far worse. Deploying a fake edge costs real money. That's why production systems use strict significance thresholds (p < 0.01, not the usual p < 0.05) and stack multiple validation methods on top of each other: • t-test on trade returns • Walk-forward validation (true out-of-sample) • Probability of backtest overfitting (PBO) analysis • Monte Carlo simulation (thousands of runs) If your strategy passes all of those, you've minimized Type I risk. If it fails any single one, don't trade it.

Type I vs Type II Error Cost Comparison

3P-Values — What They Actually Mean

P-values are the most misunderstood concept in statistics. Let's clear it up. What a p-value IS: The probability of seeing results this extreme (or more extreme) if the null hypothesis is true. What a p-value is NOT: • Not the probability your strategy is bad • Not the probability you'll lose money • Not a measure of effect size or profitability Classic mistake: "My strategy has p = 0.03, so there's a 97% chance it works." Wrong. The p-value tells you nothing about the probability of your hypothesis being true. Threshold conventions in quant finance: • p < 0.05 — Barely acceptable for academic papers, too loose for trading • p < 0.01 — Minimum for production trading systems • p < 0.001 — What you actually want before risking real capital Why so strict? Because you're typically testing many strategies. Test 100 at p < 0.05 and you expect 5 false positives by chance. That's the multiple comparisons problem, and it will eat you alive if you're not careful.

P-Value Distribution Under Null Hypothesis (No Real Edge)

4Permutation Tests — The Gold Standard

Parametric tests like t-tests assume normal distributions. Financial returns aren't normal — they have fat tails, skew, and all kinds of weirdness. Permutation tests solve this. Here's how they work: 1. Compute your strategy's actual performance metric (say, Sharpe ratio) 2. Randomly shuffle your trade signals (destroying any real pattern) 3. Compute the metric on the shuffled data 4. Repeat steps 2-3 ten thousand times 5. See where your real metric falls in the distribution of random metrics If your real Sharpe is higher than 99% of the random Sharpes, your edge is real with 99% confidence. No distributional assumptions needed. This is how serious quant shops validate strategies. If you're not running permutation tests, you're guessing.

Permutation Test: Real Strategy vs Null Distribution

5The Multiple Comparisons Problem

This is the #1 reason retail quants fail. Here's the trap: You test 100 different strategy ideas. One of them has p = 0.02. You celebrate. You trade it. It fails. Why? Because with 100 tests at p < 0.05, you expect 5 false positives. Your "winner" was probably just noise that looked like signal. Solutions: • Bonferroni correction: Divide your p-value threshold by the number of tests. 100 tests → need p < 0.0005. Harsh but honest. • FDR (False Discovery Rate): Less conservative. Controls the expected proportion of false discoveries. • Out-of-sample validation: The best defense. No amount of in-sample testing substitutes for genuine holdout data. Production-grade systems typically combine all three: strict p-value thresholds, Bonferroni or FDR adjustment, AND walk-forward out-of-sample validation. If that sounds like overkill, you haven't lost enough money yet.

6The Validation Workflow

Here's how this works in practice when you're validating a strategy: Step 1: Formulate hypotheses. H₀: strategy has no edge. H₁: strategy has a real edge. Step 2: Run initial t-test. Is the mean return significantly different from zero? Need t > 2.58 (p < 0.01). If not, stop here. Step 3: Permutation test. Run 10,000 shuffles. Where does your actual metric fall? Need to beat 99%+ of shuffled versions. Step 4: Walk-forward validation. Train on data through 2020. Test on 2021-2022. Does it still work out-of-sample? If not, it's overfit. Step 5: Monte Carlo simulation. Resample your trades thousands of times with replacement. What's the distribution of outcomes? What's the worst-case drawdown? Step 6: PBO (Probability of Backtest Overfitting). Combinatorial purged cross-validation. PBO < 0.20 = acceptable. PBO > 0.40 = your backtest is lying to you. Only after passing all six steps should you risk real money. This sounds excessive. It isn't. Ask anyone who's blown an account.

Key Formulas

T-Statistic for Mean Return

Tests whether mean return differs significantly from zero. R̄ is mean return, s is standard deviation, N is number of trades. Higher t = more significant. You want t > 2.58 for p < 0.01.

Minimum Track Record Length

How many trades you need to validate a given Sharpe ratio at significance level α. A Sharpe of 0.5 needs ~400 trades at 95% confidence. This is why serious backtests need thousands of trades, not dozens.

Hands-On Code

Permutation Test for Strategy Validation

python

import numpy as np

def permutation_test(returns, signals, n_perms=10000):
    """Test if signal-aligned returns beat random."""
    # Actual strategy performance
    actual_sharpe = returns[signals == 1].mean() / returns[signals == 1].std()
    
    # Generate null distribution
    null_sharpes = []
    for _ in range(n_perms):
        shuffled = np.random.permutation(signals)
        r = returns[shuffled == 1]
        if len(r) > 1 and r.std() > 0:
            null_sharpes.append(r.mean() / r.std())
    
    null_sharpes = np.array(null_sharpes)
    p_value = np.mean(null_sharpes >= actual_sharpe)
    
    print(f"Actual Sharpe:  {actual_sharpe:.4f}")
    print(f"Null mean:      {np.mean(null_sharpes):.4f}")
    print(f"Null 99th pctl: {np.percentile(null_sharpes, 99):.4f}")
    print(f"p-value:        {p_value:.4f}")
    print(f"Significant?    {'YES' if p_value < 0.01 else 'NO'}")
    return p_value

# No assumptions about distribution shape needed
# This is the most honest test you can run

Permutation tests are distribution-free — no normal distribution assumption required. If your strategy beats 99%+ of random shuffles, the edge is real.

Knowledge Check

Q1.In trading hypothesis testing, which error is more costly?

Q2.Why are permutation tests preferred over t-tests for trading strategies?

Assignment

Take any backtested strategy result (or simulate one). Run a permutation test with 10,000 shuffles. What's the p-value? Now intentionally overfit a strategy to random data and run the same test — the permutation test should catch it. Document both results.

← Previous Lesson Next Lesson →