Python Data Science
Python data science is dominated by two traps: using loops where vectorized operations exist, and reaching for complex solutions when pandas already has a built-in method. The expert mindset is columnar — think about operations on entire columns at once, not row by row. Know when pandas is fast (vectorized NumPy ops), when it's slow (apply(), object dtype, Python loops), and when polars is the better choice entirely.
Core Mental Model
DataFrames are columnar: each column is a contiguous array in memory. Operations on entire columns are handed off to C/NumPy implementations and run at near-C speed. Operations that process row-by-row in Python (apply, for loops, iterrows) are 10-1000x slower. The query pipeline is: load → filter → transform → aggregate → shape output. Filter early (reduce rows), select columns you need (reduce memory), then transform. Always profile before optimizing — most "slow" pandas code has one bottleneck, not many.
Pandas vs Polars
Use pandas when:
- Ecosystem integration needed (scikit-learn, statsmodels, seaborn)
- Small-medium datasets (< 1GB)
- Exploratory work, Jupyter notebooks
- Complex string operations, apply() with complex logic
- GroupBy with custom aggregations
Use polars when:
- Large datasets (> 1GB) where pandas is slow/OOM
- Multi-threaded computation needed (polars uses all cores)
- Lazy evaluation + query optimization (like Spark but in-process)
- Streaming data larger than RAM
- Pipeline-heavy ETL transformations
Speed comparison: polars is typically 5-20x faster than pandas on large datasets
# Polars lazy API: query optimizer rewrites the plan
import polars as pl
# Lazy scan — doesn't load data until collect()
df = (
pl.scan_parquet("large_dataset/*.parquet")
.filter(pl.col("status").eq("active"))
.filter(pl.col("created_at") >= pl.lit("2024-01-01").str.to_date())
.select([
"user_id", "amount", "category", "created_at"
])
.with_columns([
pl.col("amount").log1p().alias("log_amount"),
pl.col("created_at").dt.month().alias("month"),
(pl.col("amount") / pl.col("amount").mean().over("category")).alias("amount_vs_cat_mean")
])
.group_by(["user_id", "month"])
.agg([
pl.col("amount").sum().alias("total_spend"),
pl.col("amount").count().alias("transaction_count"),
pl.col("category").n_unique().alias("category_diversity")
])
.sort("total_spend", descending=True)
.collect() # execute the plan here
)
# Polars expression context (no UDFs needed for most ops)
df.with_columns([
# String ops
pl.col("email").str.to_lowercase().alias("email_lower"),
pl.col("name").str.split(" ").list.first().alias("first_name"),
# Conditional
pl.when(pl.col("amount") > 1000).then(pl.lit("high"))
.when(pl.col("amount") > 100).then(pl.lit("medium"))
.otherwise(pl.lit("low")).alias("amount_tier"),
# Window function
pl.col("amount").rank("dense").over("category").alias("rank_in_category")
])
Pandas Vectorized Idioms
import pandas as pd
import numpy as np
# ❌ Loop (10-100x slower)
for idx, row in df.iterrows():
df.at[idx, 'total'] = row['price'] * row['quantity']
# ✅ Vectorized
df['total'] = df['price'] * df['quantity']
# ❌ apply() for simple operations
df['tax'] = df['total'].apply(lambda x: x * 0.08)
# ✅ Vectorized arithmetic
df['tax'] = df['total'] * 0.08
# ✅ np.where for if/else (fastest conditional)
df['tier'] = np.where(df['total'] > 1000, 'high', 'low')
# ✅ np.select for multiple conditions
conditions = [
df['total'] > 1000,
df['total'] > 100,
df['total'] > 10
]
choices = ['high', 'medium', 'low']
df['tier'] = np.select(conditions, choices, default='minimal')
# ✅ .str accessor for string operations
df['email_domain'] = df['email'].str.split('@').str[-1]
df['name_upper'] = df['name'].str.upper()
df['has_promo'] = df['code'].str.contains(r'^PROMO\d{4}
Method Chaining with pipe()
# Method chaining: readable, testable pipeline
def clean_emails(df: pd.DataFrame) -> pd.DataFrame:
return df.assign(email=df['email'].str.lower().str.strip())
def add_customer_age(df: pd.DataFrame) -> pd.DataFrame:
return df.assign(
customer_age_days=(pd.Timestamp.now() - df['created_at']).dt.days
)
def filter_active(df: pd.DataFrame, min_orders: int = 1) -> pd.DataFrame:
return df[df['order_count'] >= min_orders]
def add_spend_tier(df: pd.DataFrame) -> pd.DataFrame:
conditions = [df['ltv'] > 1000, df['ltv'] > 100]
choices = ['vip', 'regular']
return df.assign(spend_tier=np.select(conditions, choices, default='new'))
# Clean pipeline: each step is pure function, easily unit-tested
result = (
raw_df
.pipe(clean_emails)
.pipe(add_customer_age)
.pipe(filter_active, min_orders=2)
.pipe(add_spend_tier)
.query("customer_age_days < 365")
.sort_values("ltv", ascending=False)
.reset_index(drop=True)
)
GroupBy Patterns
# transform: broadcast aggregated value back to original shape (for new column)
# Use when: you want group stats AS a column in the original DataFrame
df['mean_amount_in_group'] = df.groupby('category')['amount'].transform('mean')
df['amount_z_score'] = (
(df['amount'] - df.groupby('category')['amount'].transform('mean')) /
df.groupby('category')['amount'].transform('std')
)
# agg: aggregate down to one row per group
# Use when: you want summary statistics per group
summary = df.groupby('category').agg(
total_revenue=('amount', 'sum'),
order_count=('order_id', 'nunique'),
avg_amount=('amount', 'mean'),
p90_amount=('amount', lambda x: x.quantile(0.9)),
first_order=('created_at', 'min'),
last_order=('created_at', 'max')
).reset_index()
# apply: arbitrary function per group (use only when agg/transform won't do)
# Note: 5-10x slower than agg/transform — use as last resort
def custom_cohort_metric(group):
return pd.Series({
'cohort_ltr': group['amount'].sum() / group['customer_id'].nunique(),
'repurchase_rate': (group.groupby('customer_id').size() > 1).mean()
})
cohort_metrics = df.groupby('cohort_month').apply(custom_cohort_metric).reset_index()
# Named aggregation (pandas 0.25+): cleaner than dictionary agg
result = df.groupby(['region', 'category']).agg(
revenue=('amount', 'sum'),
orders=('order_id', 'count'),
customers=('customer_id', 'nunique')
)
Time Series with DatetimeIndex
# Set DatetimeIndex for time series operations
ts = df.set_index('timestamp').sort_index()
# Resample: like groupby for time intervals
daily_revenue = (
ts['amount']
.resample('D') # D=day, W=week, M=month, H=hour
.agg(['sum', 'count', 'mean'])
.rename(columns={'sum': 'revenue', 'count': 'orders', 'mean': 'avg_order'})
)
# Rolling window statistics
ts['revenue_7d_ma'] = ts['amount'].rolling('7D').mean() # 7-day moving average
ts['revenue_7d_std'] = ts['amount'].rolling('7D').std()
ts['revenue_28d_sum'] = ts['amount'].rolling('28D').sum()
# Expanding window (cumulative)
ts['cumulative_revenue'] = ts['amount'].expanding().sum()
# Lag/lead features for forecasting
ts['revenue_lag_7d'] = ts['amount'].shift(7) # value from 7 periods ago
ts['revenue_lag_28d'] = ts['amount'].shift(28)
ts['revenue_pct_change_7d'] = ts['amount'].pct_change(7)
# Period comparisons
ts['wow_growth'] = ts['revenue'].pct_change(7) # week-over-week
ts['yoy_growth'] = ts['revenue'].pct_change(365) # year-over-year
# Localize and convert timezones
ts.index = ts.index.tz_localize('UTC').tz_convert('America/Chicago')
Memory Optimization
# dtype selection: massive impact on memory usage
print(df.memory_usage(deep=True).sum() / 1e6, "MB") # baseline
# Downcast integers
df['user_id'] = pd.to_numeric(df['user_id'], downcast='unsigned')
df['count'] = df['count'].astype('int16') # if max < 32767
df['amount'] = df['amount'].astype('float32') # if full precision not needed
# Categorical for low-cardinality strings (huge savings)
df['status'] = df['status'].astype('category') # 10 unique values → ~95% memory saving
df['country'] = df['country'].astype('category')
df['category'] = df['category'].astype('category')
# Before/after
print(df.memory_usage(deep=True).sum() / 1e6, "MB")
# Chunked reading for large files
chunk_results = []
for chunk in pd.read_csv('huge_file.csv', chunksize=100_000,
dtype={'user_id': 'int32', 'amount': 'float32'},
parse_dates=['created_at']):
# Process each chunk
processed = chunk.pipe(your_processing_pipeline)
chunk_results.append(processed)
result = pd.concat(chunk_results, ignore_index=True)
# Select only needed columns on load (skip reading unwanted columns)
df = pd.read_csv('data.csv', usecols=['user_id', 'amount', 'created_at', 'status'])
# Sparse arrays for data with many NaN values
df['optional_field'] = pd.arrays.SparseArray(df['optional_field'])
EDA Checklist
def eda(df: pd.DataFrame, target_col: str = None) -> None:
"""Systematic exploratory data analysis."""
print("=== SHAPE ===")
print(f"Rows: {len(df):,}, Cols: {df.shape[1]}")
print("\n=== DTYPES & NULLS ===")
null_report = pd.DataFrame({
'dtype': df.dtypes,
'nulls': df.isnull().sum(),
'null_pct': (df.isnull().sum() / len(df) * 100).round(1),
'nunique': df.nunique()
})
print(null_report[null_report['nulls'] > 0].sort_values('null_pct', ascending=False))
print("\n=== NUMERIC DISTRIBUTIONS ===")
print(df.describe(percentiles=[.01, .05, .25, .5, .75, .95, .99]).round(2))
print("\n=== CATEGORICAL DISTRIBUTIONS ===")
for col in df.select_dtypes(include=['object', 'category']).columns:
vc = df[col].value_counts()
print(f"\n{col}: {df[col].nunique()} unique")
print(vc.head(10).to_string())
if target_col and target_col in df.columns:
print(f"\n=== TARGET: {target_col} ===")
print(df[target_col].describe())
print("\n=== CORRELATIONS WITH TARGET ===")
correlations = df.select_dtypes('number').corr()[target_col].abs()
print(correlations.sort_values(ascending=False).head(20))
print("\n=== DUPLICATE ROWS ===")
print(f"Exact duplicates: {df.duplicated().sum():,}")
print("\n=== DATE RANGE (if datetime cols) ===")
for col in df.select_dtypes(include=['datetime64']).columns:
print(f"{col}: {df[col].min()} → {df[col].max()}")
Visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Style
sns.set_theme(style="darkgrid", palette="muted")
plt.rcParams['figure.dpi'] = 120
# Correlation heatmap
def plot_correlation_heatmap(df: pd.DataFrame, figsize=(12, 10)):
numeric_df = df.select_dtypes('number')
corr = numeric_df.corr()
mask = np.triu(np.ones_like(corr, dtype=bool)) # hide upper triangle
fig, ax = plt.subplots(figsize=figsize)
sns.heatmap(corr, mask=mask, annot=True, fmt='.2f', cmap='RdBu_r',
center=0, vmin=-1, vmax=1, ax=ax,
cbar_kws={'shrink': 0.8})
ax.set_title('Feature Correlations', fontsize=14)
plt.tight_layout()
return fig
# Distribution comparison
def plot_distributions(df: pd.DataFrame, cols: list, hue: str = None):
n = len(cols)
fig, axes = plt.subplots(1, n, figsize=(5 * n, 4))
if n == 1:
axes = [axes]
for ax, col in zip(axes, cols):
sns.histplot(data=df, x=col, hue=hue, kde=True, ax=ax)
ax.set_title(col)
plt.tight_layout()
return fig
# Time series with matplotlib (better than seaborn for complex time charts)
def plot_time_series(ts: pd.Series, title: str, ma_window: int = 7):
fig, ax = plt.subplots(figsize=(14, 5))
ax.plot(ts.index, ts.values, alpha=0.4, linewidth=1, label='Daily')
ax.plot(ts.index, ts.rolling(ma_window).mean(), linewidth=2,
label=f'{ma_window}-day MA')
ax.set_title(title)
ax.legend()
ax.xaxis.set_major_formatter(plt.matplotlib.dates.DateFormatter('%Y-%m'))
plt.xticks(rotation=45)
plt.tight_layout()
return fig
# Interactive: plotly for dashboards / notebooks
import plotly.express as px
fig = px.scatter(df, x='amount', y='frequency', color='category',
size='ltv', hover_data=['user_id'],
title='RFM Scatter Plot')
fig.show()
Anti-Patterns
# ❌ iterrows() — Python loop, 100-1000x slower than vectorized
for idx, row in df.iterrows():
result.append(compute(row))
# ✅ Vectorized: df.apply(compute, axis=1) — or better: vectorized column ops
# ❌ Concatenating in a loop (O(n²) memory copying)
result = pd.DataFrame()
for chunk in chunks:
result = pd.concat([result, chunk])
# ✅ Collect then concat once
result = pd.concat(chunks, ignore_index=True)
# ❌ Chained indexing (SettingWithCopyWarning, silent bugs)
df[df['status'] == 'active']['amount'] = 0
# ✅ Use .loc with boolean mask
df.loc[df['status'] == 'active', 'amount'] = 0
# ❌ object dtype for numeric columns (slow, memory-hungry)
df['amount'] = df['amount'].astype(object) # accidentally or from CSV
# ✅ Always enforce dtypes on load
df = pd.read_csv('data.csv', dtype={'amount': 'float32', 'user_id': 'int32'})
# ❌ apply() for simple math
df['total'] = df.apply(lambda r: r['price'] * r['qty'], axis=1)
# ✅ Vectorized arithmetic
df['total'] = df['price'] * df['qty']
Quick Reference
Vectorization Speed Order (fastest → slowest):
1. Built-in pandas/NumPy operations (sum, mean, std)
2. NumPy ufuncs on .values array
3. .str / .dt accessors
4. np.where / np.select for conditionals
5. .apply(lambda, axis=0) — column-wise
6. .apply(lambda, axis=1) — row-wise (use sparingly)
7. iterrows() / itertuples() ← NEVER for large datasets
GroupBy Guide:
New column with group stat → transform
Summary table → agg
Complex custom logic → apply (last resort)
Memory Hierarchy:
bool 1 byte
int8/uint8 1 byte
int16/uint16 2 bytes
int32/float32 4 bytes
int64/float64 8 bytes (default)
object variable (worst)
category ~4 bytes/value (for repeated strings)
Polars vs Pandas Decision:
> 1GB dataset → polars
Multi-core needed → polars
scikit-learn integration → pandas
Complex apply() logic → pandas (or polars expr)
, na=False)
# ✅ .dt accessor for datetime operations
df['year'] = df['created_at'].dt.year
df['day_of_week'] = df['created_at'].dt.day_name()
df['days_since'] = (pd.Timestamp.now() - df['created_at']).dt.days
df['week_start'] = df['created_at'].dt.to_period('W').dt.start_time
Method Chaining with pipe()
__CODE_BLOCK_3__GroupBy Patterns
__CODE_BLOCK_4__Time Series with DatetimeIndex
__CODE_BLOCK_5__Memory Optimization
__CODE_BLOCK_6__EDA Checklist
__CODE_BLOCK_7__Visualization
__CODE_BLOCK_8__Anti-Patterns
__CODE_BLOCK_9__Quick Reference
__CODE_BLOCK_10__Skill Information
- Source
- MoltbotDen
- Category
- Data & Analytics
- Repository
- View on GitHub
Related Skills
sql-expert
Write advanced SQL queries for analytics, reporting, and application databases. Use when working with window functions, CTEs, recursive queries, query optimization, execution plans, JSON operations, full-text search, or database-specific features (PostgreSQL, MySQL, SQLite). Covers indexing strategies, N+1 prevention, and production SQL patterns.
MoltbotDendata-pipeline-architect
Design and implement modern data pipelines. Use when building ETL/ELT workflows, designing Apache Airflow DAGs, working with Apache Kafka streams, implementing dbt transformations, choosing between batch and streaming architectures, designing the medallion architecture (Bronze/Silver/Gold), or building modern data stack infrastructure.
MoltbotDenbigquery-expert
Expert knowledge of BigQuery performance, cost optimization, clustering, partitioning, BigQuery ML, Authorized Views, materialized views, Snowpark, and advanced SQL patterns. Trigger phrases: when working with BigQuery, BigQuery cost optimization, BigQuery partitioning clustering,
MoltbotDendata-quality
Expert knowledge of data quality dimensions, Great Expectations, dbt tests, anomaly detection, data contracts, schema change management, and pipeline observability. Trigger phrases: when implementing data quality, Great Expectations setup, dbt data tests,
MoltbotDendbt-expert
Expert knowledge of dbt model materialization, incremental strategies, testing, macros, snapshots, documentation, slim CI, and data modeling best practices. Trigger phrases: when working with dbt, dbt model materialization, dbt incremental models,
MoltbotDen