HYP Hypothesis Testing for Traders
How to prove your strategy isn't just luck
Learning Objectives
- •Formulate null and alternative hypotheses for trading strategies
- •Understand Type I and Type II errors in a trading context
- •Apply t-tests and permutation tests to validate trading signals
- •Interpret p-values correctly and avoid the classic misinterpretations
Explain Like I'm 5
Hypothesis testing asks one question: "Is my strategy actually good or did I just get lucky?" It's like someone claiming they can shoot free throws at 70%. You'd say "prove it — shoot 100." If they hit 72, maybe it's real. If they hit 55, maybe they got lucky early. Hypothesis testing is the math behind "prove it."
Think of It This Way
Think of hypothesis testing like a courtroom trial. The null hypothesis is "innocent until proven guilty" — your strategy has no edge. The alternative is "guilty" — there is an edge. You need enough evidence to convict beyond reasonable doubt (p < 0.01). Weak evidence? Strategy walks free. You don't trade it.
1The Framework: H₀ vs H₁
2Type I and Type II Errors
Type I vs Type II Error Cost Comparison
3P-Values — What They Actually Mean
P-Value Distribution Under Null Hypothesis (No Real Edge)
4Permutation Tests — The Gold Standard
Permutation Test: Real Strategy vs Null Distribution
5The Multiple Comparisons Problem
6The Validation Workflow
Key Formulas
T-Statistic for Mean Return
Tests whether mean return differs significantly from zero. R̄ is mean return, s is standard deviation, N is number of trades. Higher t = more significant. You want t > 2.58 for p < 0.01.
Minimum Track Record Length
How many trades you need to validate a given Sharpe ratio at significance level α. A Sharpe of 0.5 needs ~400 trades at 95% confidence. This is why serious backtests need thousands of trades, not dozens.
Hands-On Code
Permutation Test for Strategy Validation
import numpy as np
def permutation_test(returns, signals, n_perms=10000):
"""Test if signal-aligned returns beat random."""
# Actual strategy performance
actual_sharpe = returns[signals == 1].mean() / returns[signals == 1].std()
# Generate null distribution
null_sharpes = []
for _ in range(n_perms):
shuffled = np.random.permutation(signals)
r = returns[shuffled == 1]
if len(r) > 1 and r.std() > 0:
null_sharpes.append(r.mean() / r.std())
null_sharpes = np.array(null_sharpes)
p_value = np.mean(null_sharpes >= actual_sharpe)
print(f"Actual Sharpe: {actual_sharpe:.4f}")
print(f"Null mean: {np.mean(null_sharpes):.4f}")
print(f"Null 99th pctl: {np.percentile(null_sharpes, 99):.4f}")
print(f"p-value: {p_value:.4f}")
print(f"Significant? {'YES' if p_value < 0.01 else 'NO'}")
return p_value
# No assumptions about distribution shape needed
# This is the most honest test you can runPermutation tests are distribution-free — no normal distribution assumption required. If your strategy beats 99%+ of random shuffles, the edge is real.
Knowledge Check
Q1.In trading hypothesis testing, which error is more costly?
Q2.Why are permutation tests preferred over t-tests for trading strategies?
Assignment
Take any backtested strategy result (or simulate one). Run a permutation test with 10,000 shuffles. What's the p-value? Now intentionally overfit a strategy to random data and run the same test — the permutation test should catch it. Document both results.