ENS Ensemble Methods
The wisdom of crowds — why committees beat individuals
Learning Objectives
- •Understand why ensembles work (bias-variance decomposition)
- •Learn the main ensemble techniques: bagging, boosting, stacking
- •See how ensemble agreement serves as a confidence indicator
- •Know when ensembles help and when they add complexity without benefit
Explain Like I'm 5
One analyst can be wrong. Five independent analysts are less likely to all be wrong in the same direction. Ensembles apply this logic to models: train multiple models, combine their predictions, and the result is more robust than any individual. The key word is "independent" — five copies of the same model are useless.
Think of It This Way
Think of jury duty. One person might have biases or make mistakes. Twelve people, each with different perspectives, are more likely to reach a fair verdict. But only if they truly deliberate independently — if they all just follow the loudest voice, the jury is no better than one person.
1Why Ensembles Work — The Math
Ensemble Error Reduction as Members Increase
2Bagging, Boosting, and Stacking
Performance by Ensemble Method (Out-of-Sample)
3Ensemble Agreement as Confidence
Win Rate by Ensemble Agreement Level
4When Ensembles Don't Help
Key Formulas
Ensemble Prediction (Bagging)
Average prediction across M models. Each model f_m was trained on a different bootstrap sample. Variance decreases proportionally to 1/M for truly independent models.
Bias-Variance Decomposition
Total error is the sum of systematic error (bias), random error (variance), and irreducible noise. Ensembles reduce the variance component without affecting bias or noise.
Hands-On Code
Bootstrap Ensemble with Agreement Scoring
import numpy as np
import xgboost as xgb
class BootstrapEnsemble:
"""Bagged XGBoost ensemble with agreement-based confidence."""
def __init__(self, n_models=5, **xgb_params):
self.n_models = n_models
self.params = xgb_params
self.models = []
def train(self, X, y):
"""Train ensemble on bootstrap samples."""
n = len(X)
for i in range(self.n_models):
# Bootstrap sample (sample with replacement)
idx = np.random.choice(n, size=n, replace=True)
dtrain = xgb.DMatrix(X[idx], label=y[idx])
model = xgb.train(self.params, dtrain, num_boost_round=300)
self.models.append(model)
print(f" Model {i+1}/{self.n_models} trained")
def predict(self, X):
"""Predict with agreement score."""
dtest = xgb.DMatrix(X)
preds = np.array([m.predict(dtest) for m in self.models])
# Binary predictions per model
votes = (preds > 0.5).astype(int)
agreement = votes.mean(axis=0) # % of models agreeing
# Ensemble probability = mean of individual probabilities
ensemble_prob = preds.mean(axis=0)
return {
'probability': ensemble_prob,
'agreement': agreement,
'confidence': np.abs(agreement - 0.5) * 2, # 0-1 scale
}Five XGBoost models trained on different bootstrap samples. Agreement score acts as a natural confidence indicator. Full agreement = high confidence = full position size.
Knowledge Check
Q1.Why does averaging 5 model predictions reduce error?
Assignment
Build a 5-model bootstrap ensemble. For each test sample, record both the ensemble prediction and the agreement level. Plot win rate vs agreement level. Verify that high-agreement predictions have higher win rates.