← Back to Learn
III AdvancedWeek 27 • Lesson 76Duration: 50 min

SHIP ML Model Deployment

From .pkl files to live predictions — shipping your models is the hard part

Learning Objectives

  • Deploy ML models for real-time inference
  • Handle model versioning and rollback
  • Monitor model performance in production

Explain Like I'm 5

Here's the dirty secret of machine learning: training a model is about 20% of the work. Deploying it so it actually runs on live data, correctly, reliably, 24/7 — that's the other 80%. Most ML tutorials end at "model.fit()" and leave you completely unprepared for deployment. Deployment means: loading the right model version, feeding it correctly formatted features, getting predictions at the right time, and monitoring for degradation. It's a fundamentally different skillset from training.

Think of It This Way

Training a model is like building a prototype car in the garage. Deploying it is like mass-producing that car so it runs reliably on real roads, in all weather, with real drivers. Your prototype worked great in the garage — but does it handle rain? Potholes? 100K miles? That's the difference between training and deployment.

1Model Deployment in Practice

Here's what deployment actually looks like in a production trading system: Model loading (at startup): - Load all model files (.pkl, .pt, .h5) from versioned directories - Hash-verify every file against a frozen manifest - If any hash doesn't match -> refuse to start. Seriously. - This prevents you from accidentally running a wrong model version Inference pipeline (every new bar): 1. Receive new bar data from broker 2. Compute feature vector (your 30-40 features) 3. Determine which model to use (cluster-based routing) 4. Run prediction (XGBoost -> probability score) 5. Apply decision gate (enter/skip based on score) 6. If entering: size the position + set SL/TP Critical deployment requirements: - Model version matches training features — if you trained on 38 features, you MUST serve 38 features in the SAME ORDER. Feature drift = garbage predictions. - Inference latency must be acceptable for your timeframe - Graceful error handling — model error = no trade (safe default) - Log every prediction for audit trail The feature matching is where most deployment bugs hide. You add a feature, retrain, deploy, but forget to update the live feature pipeline. Model gets 37 features instead of 38. Predictions are garbage but it doesn't crash — it silently gives you wrong answers.

2Model Versioning and Rollback

Versioning — this is non-negotiable: - Every model file gets an MD5/SHA256 hash - Hashes stored in a frozen manifest file - load_models() verifies every hash at startup - If hashes don't match -> the system refuses to trade - This catches: accidental overwrites, corrupted files, wrong versions Rollback — because new models sometimes underperform: - Keep the previous N model versions (at least 2) - If new model underperforms, revert to previous hash-verified version - Rollback should take minutes, not hours - Production tip: test rollback BEFORE you need it Deployment strategies: - Blue-green: run old and new model side by side, switch traffic - Canary: send 10% of signals to new model, monitor, then ramp up - Shadow: new model runs alongside but doesn't trade, just logs predictions - For individual quants: shadow testing is the most practical approach The urge to "just deploy the new model" is strong. Resist it. Every model update should go through shadow testing first.

3Monitoring — Your Model Will Degrade

This is the part most people skip and later regret. Models WILL degrade over time. Markets change. Regime shifts happen. The question isn't IF, it's WHEN: 1. Prediction distribution shift: Are predictions still in expected range? - Your model should output probabilities between [0, 1] - If the mean prediction suddenly shifts from 0.52 to 0.65, something changed - Could be feature drift, data issue, or genuine regime change 2. Feature drift: Are inputs still similar to training data? - Compare live feature distributions to what the model was trained on - Significant drift = model operating outside its training domain - Think of it like: you trained a model to drive in sunny weather, now it's snowing 3. Outcome tracking: Are predictions still accurate? - Rolling IC (information coefficient) — should stay positive - Rolling win rate — monitor for sustained drops - If IC goes negative for 30+ days, investigate immediately 4. Latency monitoring: Is inference still fast enough? - Models can slow down if input size changes or memory leaks accumulate 5. Error rates: Are model errors increasing? - NaN predictions, invalid outputs, unexpected exceptions

Model Signal Quality Degradation Over Time (Rolling IC)

4The Feature-Training Mismatch Problem

This is the number one deployment bug in ML trading systems, and it's insidious: The scenario: You trained your model on 38 features. You deploy it. Everything works. Then 3 months later, you add a new feature to your pipeline. Now your pipeline computes 39 features. Your model expects 38. What happens? If you're lucky: It crashes with a dimension mismatch error. You fix it immediately. If you're unlucky: The extra feature gets silently dropped or the features shift by one position. RSI gets fed where MACD should be. The model doesn't crash — it just gives you subtly wrong predictions. You lose money for weeks before noticing. The fix: - Store the exact feature list and order WITH the model - At inference time, verify: features_received == features_expected - Fail LOUDLY if they don't match — do NOT try to "fix" mismatched features at runtime - Version your feature pipeline alongside your model — they're a package deal This is why experienced ML engineers are particularly careful about feature pipelines. The model is the easy part. Getting the right features to the right model in the right order at the right time — that's where real engineering happens.

5Slippage — The Silent Performance Killer

Your model says "BUY EURUSD" and your backtest shows entry at 1.08500. But in live trading, you actually get filled at 1.08517. That 1.7 pip difference is slippage. Slippage sources: - Execution delay: time between signal and order reaching broker - Spread: bid-ask spread widens during volatile periods - Market impact: your order moves the price (mostly relevant for large orders) - Requotes: broker rejects your price, fills at a different one How much does it actually matter? - At 0.5 pips average slippage x 600 trades/year x 10/pip=10/pip =3,000/year - That's real money. Your backtest doesn't account for it unless you explicitly model it. Production mitigation: - Use limit orders instead of market orders when possible - Account for expected slippage in your backtest assumptions - Monitor actual slippage vs expected — if it's consistently higher, investigate - Trade during high-liquidity hours for tighter spreads If your strategy barely breaks even in backtest, slippage will kill it in live trading. You need enough edge to absorb these real-world costs.

Slippage Impact on Annual Returns (600 trades/year)

Hands-On Code

Model Deployment with Monitoring

python
import hashlib
import pickle
import numpy as np
import logging

logger = logging.getLogger('model_deployment')

class ModelDeployment:
    """Deploy and monitor ML models in production."""
    
    def __init__(self, model_path, expected_hash):
        self.model = self._load_verified(model_path, expected_hash)
        self.predictions = []
        self.prediction_count = 0
    
    def _load_verified(self, path, expected_hash):
        """Load model with hash verification."""
        with open(path, 'rb') as f:
            data = f.read()
        
        actual_hash = hashlib.md5(data).hexdigest()
        if actual_hash != expected_hash:
            raise ValueError(f"Model hash mismatch! Expected {expected_hash}, got {actual_hash}")
        
        logger.info(f"Model loaded and verified: {path}")
        return pickle.loads(data)
    
    def predict(self, features):
        """Make prediction with monitoring."""
        try:
            pred = self.model.predict_proba(features.reshape(1, -1))[0, 1]
            
            # Monitor prediction distribution
            self.predictions.append(pred)
            self.prediction_count += 1
            
            if self.prediction_count % 100 == 0:
                recent = self.predictions[-100:]
                logger.info(
                    f"Prediction stats (last 100): "
                    f"mean={np.mean(recent):.3f}, "
                    f"std={np.std(recent):.3f}"
                )
            
            return pred
            
        except Exception as e:
            logger.error(f"Prediction failed: {e}")
            return None  # graceful fallback

Hash verification ensures you're running the correct model version. Prediction monitoring catches drift and degradation early. Graceful error handling prevents failures from cascading. Unglamorous code, but it's the difference between a hobby project and a production system.

Knowledge Check

Q1.Your model's rolling win rate dropped from 59% to 48% over the last month. What should you do?

Assignment

Implement model deployment with hash verification and prediction monitoring. Load your L1 model, make 1000 predictions on test data, and track the prediction distribution. Set up alerts for distribution shift.