I BeginnerWeek 2 • Lesson 4Duration: 50 min

PYT Python for Quantitative Finance

The tools most quant teams actually use every day

Learning Objectives

•Set up a Python environment for quantitative finance
•Use pandas DataFrames for financial time series
•Understand numpy vectorized operations and why they matter
•Build a basic data pipeline from raw data to analysis

Explain Like I'm 5

Python is a powerful calculator that crunches millions of numbers in seconds. Instead of staring at charts hoping to spot patterns, you write a few instructions and the computer finds them faster, more accurately, and without emotional bias. Every major quant firm uses Python as a primary research language.

Think of It This Way

If quant trading is professional cooking, Python is your industrial kitchen. You could cook over a campfire (manual analysis), but a professional kitchen — with pandas for prep, numpy for fast computation, and scikit-learn for ML — lets you operate at institutional scale.

1Why Python Won

Python became the standard for quant finance research, and the reasons are practical: The ecosystem is deep: • pandas — built for financial time series with datetime indexing and rolling windows • numpy — vectorized math, typically 100x faster than Python loops • scipy — statistics, optimization, signal processing • scikit-learn — clean ML API for classification, regression, clustering • XGBoost / LightGBM — gradient-boosted trees dominating tabular prediction • PyTorch — deep learning for sequence models (LSTM, Transformers) Speed of iteration: You can test ideas in hours instead of days. That matters when most hypotheses fail. Community: The largest community of quant practitioners means most problems have documented solutions. C++ is faster for HFT execution. R is still strong for pure statistics. But Python offers the best balance of capability, development speed, and ecosystem breadth. Reference: Hilpisch, Y. (2018). "Python for Finance." O'Reilly Media.

2pandas — Working with Financial Data

pandas DataFrames are how you'll spend most of your time working with data. The key concepts: • DataFrame — 2D structure. Rows = timestamps, columns = OHLCV. • Series — A single column (like closing prices). • DatetimeIndex — Enables resampling, rolling windows, and date slicing. • Vectorized operations — Apply computations to entire columns at once. Performance rule: Never loop through rows with `for`. pandas uses optimized C/Cython under the hood. A loop over 1M rows takes ~30 seconds. The vectorized version takes milliseconds. The operations you'll use constantly: • `.pct_change()` — period-over-period returns • `.rolling(n).mean()` / `.std()` — moving averages and rolling volatility • `.shift(n)` — lag or lead a series (critical for avoiding look-ahead bias) • `.resample()` — convert between timeframes (M15 → H1 → D1)

3numpy — The Fast Math Layer

numpy handles the heavy computation behind everything in Python quant work. What you'll use: • `np.array` — fast, memory-efficient arrays with vectorized math • `np.mean`, `np.std`, `np.percentile` — statistical aggregates • `np.random` — random number generation for Monte Carlo • `np.corrcoef` — correlation matrices for portfolio analysis • `np.linalg` — linear algebra (regression, PCA, eigendecomposition) Monte Carlo simulations — the backbone of risk management — run on numpy. A simulation needing 20,000 iterations finishes in seconds. With Python loops, it would take minutes. The speed difference isn't marginal — it's typically 100x or more. Any computation you run in production (backtesting, live signals, risk calculations) needs numpy vectorization.

Python Loops vs numpy Vectorization — Execution Time

Key Formulas

Simple Returns

Arithmetic returns — computed with df["close"].pct_change(). Intuitive for single-period analysis but not additive across time.

Logarithmic Returns

Log returns are additive over time: multi-period log return = sum of individual-period log returns. This is why they're preferred for statistical modeling and multi-period analysis.

Hands-On Code

Financial Data Pipeline — From Raw Data to Analysis

python

import pandas as pd
import numpy as np

# --- Load and prepare financial data ---
df = pd.read_csv('EURUSD_M15.csv', parse_dates=['time'])
df = df.set_index('time').sort_index()

# --- Compute returns and core features ---
df['returns']      = df['close'].pct_change()
df['log_returns']  = np.log(df['close'] / df['close'].shift(1))
df['volatility']   = df['returns'].rolling(20).std()
df['sma_50']       = df['close'].rolling(50).mean()
df['sma_200']      = df['close'].rolling(200).mean()

# --- Summary statistics (fully vectorized) ---
print(f"Mean daily return:  {df['returns'].mean():.6f}")
print(f"Daily volatility:   {df['returns'].std():.6f}")
sharpe = df['returns'].mean() / df['returns'].std() * np.sqrt(252)
print(f"Annualized Sharpe:  {sharpe:.2f}")

# --- Identify high-volatility regimes ---
vol_95 = df['volatility'].quantile(0.95)
high_vol = df[df['volatility'] > vol_95]
print(f"High-vol bars:      {len(high_vol)} ({len(high_vol)/len(df)*100:.1f}%)")

This is the skeleton of every quant pipeline: load data, compute features with vectorized operations, generate summary stats. No for loops anywhere. This pattern scales from prototype to production.

Knowledge Check

Q1.Why should for loops be avoided when processing pandas DataFrames?

Q2.What is the primary advantage of logarithmic returns over simple returns?

Assignment

Download OHLCV data for any major currency pair. Compute: daily returns, 20-day rolling volatility, 50 and 200-day moving averages, and RSI(14). Create a 4-panel matplotlib figure with all indicators. Find periods where volatility exceeds its 95th percentile and research what was happening in markets at those times.

← Previous Lesson Next Lesson →