Pandas for Finance: Essential DataFrame Operations
If you do any quantitative work in Python, you will live in Pandas. For finance specifically, a handful of patterns cover 80 percent of what you will ever need. This tutorial walks through those patterns using real price data.
Reading Price Data
Most financial data starts life as a CSV. The critical thing is to parse the date column properly and use it as the index - this unlocks all of Pandas time series magic. If you skip this step, half the library stops working.
import pandas as pd
# Read a CSV of historical prices
df = pd.read_csv('AAPL.csv', parse_dates=['Date'], index_col='Date')
# Basic inspection
print(df.head())
print(df.info())
print(df.describe())
Filtering Data
Pandas boolean indexing is one of the most expressive query interfaces in any language. You build a boolean mask and index the DataFrame with it. For date ranges, use .loc with slice notation - this only works if your index is a DatetimeIndex.
# Filter by date range
recent = df.loc['2024-01-01':'2024-12-31']
# Filter by condition
big_moves = df[df['Close'].pct_change().abs() > 0.03]
# Multiple conditions
green_high_volume = df[(df['Close'] > df['Open']) & (df['Volume'] > df['Volume'].mean())]
GroupBy for Aggregation
GroupBy is how you roll data up to different granularities. The split-apply-combine pattern is deceptively powerful: split rows into groups, apply a function to each group, combine results. For time series, pd.Grouper lets you group by date frequency without needing to create derived columns.
# Monthly average close
monthly_avg = df.groupby(pd.Grouper(freq='M'))['Close'].mean()
# By year and month
df['Year'] = df.index.year
df['Month'] = df.index.month
summary = df.groupby(['Year', 'Month']).agg({
'Close': ['mean', 'std'],
'Volume': 'sum'
})
Merging Multiple Assets
To compute cross-asset statistics like correlation, you need data aligned on the same dates. pd.merge does SQL-style joins - inner joins drop dates that do not exist in both, which you usually want for correlation or regression analysis.
import pandas as pd
aapl = pd.read_csv('AAPL.csv', parse_dates=['Date'])
msft = pd.read_csv('MSFT.csv', parse_dates=['Date'])
# Merge on Date
combined = pd.merge(
aapl[['Date', 'Close']].rename(columns={'Close': 'AAPL'}),
msft[['Date', 'Close']].rename(columns={'Close': 'MSFT'}),
on='Date',
how='inner'
)
# Correlation
print(combined[['AAPL', 'MSFT']].corr())
Resampling and Rolling Windows
This is where Pandas really earns its place in finance. resample converts between frequencies (daily to weekly, monthly to yearly). rolling applies a function over a sliding window - the foundation of every moving average, volatility estimate, and momentum indicator you will ever build.
# Daily to weekly
weekly = df['Close'].resample('W').agg({
'open': 'first',
'high': 'max',
'low': 'min',
'close': 'last'
})
# Daily to monthly returns
monthly_returns = df['Close'].resample('M').last().pct_change()
# Rolling 20-day average (simple moving average)
df['SMA20'] = df['Close'].rolling(window=20).mean()
# Rolling volatility (annualised)
df['Vol20'] = df['Close'].pct_change().rolling(20).std() * (252 ** 0.5)
Common Pitfalls
- Look-ahead bias: when calculating rolling features, make sure the window only uses past data. The default rolling is left-inclusive so a rolling(20).mean() at row i uses rows i-19 through i. That is fine. Rolling operations that use centered windows leak future data into features.
- Returns vs log returns:
pct_change()gives simple returns. For compounding and most statistical analysis, usenp.log(df / df.shift(1)). - Missing dates: markets are closed on weekends and holidays. If you want a business-day index, use
pd.bdate_rangeor reindex with a business day frequency. - Timezones: mixing timezone-aware and naive timestamps throws errors. Pick one (usually UTC) and stick with it.
- SettingWithCopyWarning: if you see this, use
.loc[row, col] = valueexplicitly instead of chained assignment.
Where to Go Next
Once these patterns are second nature, the next step is NumPy for when Pandas is too slow, and libraries like pandas-ta or bt for ready-made indicators and backtesting. But you can build an entire research workflow with just Pandas and matplotlib, and that is where most quants spend most of their time.
Risk Warning
Code examples are for educational purposes. Past market data does not predict future results. Test thoroughly before using in production.