Python Data Science Expert
When to Apply
- •Writing Python for data analysis, forecasting, or scientific computing
- •Reviewing data science or ML pipeline code
- •Adding type hints, docstrings, or error handling to data code
Code Style & Structure
1. Type Hints Everywhere
Use typing extensively: List, Dict, Optional, Union, Tuple. Use dataclasses for structured data. Never omit return type hints.
from typing import List, Dict, Optional, Tuple
from dataclasses import dataclass
from datetime import datetime
@dataclass
class TimeSeriesPoint:
timestamp: datetime
value: float
confidence: Optional[float] = None
def process_data(
df: pd.DataFrame,
window_size: int = 24
) -> Tuple[np.ndarray, np.ndarray]:
"""Process time series with rolling window."""
...
2. Docstrings (Google Style)
Every public function needs a docstring with Args, Returns, Raises, and Example when helpful.
def calculate_metrics(
predictions: np.ndarray,
actuals: np.ndarray,
metric_types: List[str]
) -> Dict[str, float]:
"""
Calculate forecasting error metrics.
Args:
predictions: Predicted values array (n_samples,)
actuals: Actual observed values (n_samples,)
metric_types: List of metrics to compute ['mae', 'mape', 'rmse']
Returns:
Dictionary mapping metric names to computed values
Raises:
ValueError: If predictions and actuals have different lengths
Example:
>>> pred = np.array([100, 105, 110])
>>> actual = np.array([102, 103, 112])
>>> metrics = calculate_metrics(pred, actual, ['mae', 'rmse'])
"""
if len(predictions) != len(actuals):
raise ValueError(f"Length mismatch: {len(predictions)} vs {len(actuals)}")
...
3. Error Handling
Use specific exception types, not bare except:. Log errors with context. Provide helpful messages.
import logging
logger = logging.getLogger(__name__)
try:
data = load_csv(filepath)
except FileNotFoundError:
logger.error(f"Data file not found: {filepath}")
raise
except pd.errors.EmptyDataError:
logger.error(f"CSV file is empty: {filepath}")
raise ValueError(f"Cannot process empty file: {filepath}")
except Exception as e:
logger.exception(f"Unexpected error loading {filepath}")
raise RuntimeError(f"Failed to load data: {e}") from e
4. Defensive Programming
Validate inputs early. Use assertions for invariants. Check edge cases (empty data, constant columns, nulls). Use defensive copies when modifying.
def normalize_timeseries(
data: pd.DataFrame,
columns: List[str]
) -> pd.DataFrame:
"""Normalize specified columns to [0, 1] range."""
assert len(data) > 0, "Cannot normalize empty dataframe"
assert all(col in data.columns for col in columns), \
f"Missing columns: {set(columns) - set(data.columns)}"
null_counts = data[columns].isnull().sum()
if null_counts.any():
logger.warning(f"Null values detected: {null_counts[null_counts > 0]}")
result = data.copy()
for col in columns:
col_min, col_max = result[col].min(), result[col].max()
if col_max == col_min:
logger.warning(f"Column {col} is constant, setting to 0.5")
result[col] = 0.5
else:
result[col] = (result[col] - col_min) / (col_max - col_min)
return result
5. Pandas Best Practices
Use vectorized operations; avoid row loops. Chain operations when clear. Use .loc / .iloc explicitly. Be explicit about copies vs views.
# GOOD: Vectorized, explicit
df = (
pd.read_csv('data.csv')
.assign(
timestamp=lambda x: pd.to_datetime(x['timestamp']),
hour=lambda x: x['timestamp'].dt.hour,
is_weekend=lambda x: x['timestamp'].dt.dayofweek >= 5
)
.sort_values('timestamp')
.reset_index(drop=True)
)
weekend_data = df.loc[df['is_weekend'], ['timestamp', 'load']]
# BAD: Loops, implicit behavior
# for i in range(len(df)): df['hour'][i] = ... # SettingWithCopyWarning!
6. NumPy Efficiency
Preallocate when size is known. Use appropriate dtypes (float32 vs float64). Leverage broadcasting.
n_samples = 1000 results = np.zeros(n_samples, dtype=np.float32) data = np.random.rand(100, 10) means = data.mean(axis=0) normalized = data - means # Broadcasting row_sums = large_array.sum(axis=1) # Prefer over loops
7. Configuration Management
Use dataclasses or Pydantic for config. Load secrets from environment variables. Provide sensible defaults and validate in __post_init__.
from dataclasses import dataclass, field
import os
@dataclass
class ForecastConfig:
forecast_horizon: int = 24
context_window_hours: int = 168
models: List[str] = field(default_factory=lambda: ['gemini', 'modal'])
ensemble_weights: List[float] = field(default_factory=lambda: [0.6, 0.4])
gemini_api_key: str = field(default_factory=lambda: os.getenv('GEMINI_API_KEY', ''))
def __post_init__(self):
assert self.forecast_horizon > 0
assert len(self.models) == len(self.ensemble_weights)
assert abs(sum(self.ensemble_weights) - 1.0) < 1e-6
8. Code Organization
Order: standard imports → third-party → local. Constants at module level. No magic numbers; use named constants.
# Standard imports first import os from pathlib import Path from typing import List, Dict, Optional, Tuple from datetime import datetime # Third-party import numpy as np import pandas as pd from loguru import logger # Local from .config import ForecastConfig # Constants DEFAULT_FORECAST_HORIZON = 24 MAX_CONTEXT_WINDOW = 720
Checklist When Writing Code
- •Always: Type hints on function signatures, docstrings for public functions, explicit error handling with helpful messages, logging of important events/errors, input validation before processing
- •Prefer: Vectorized operations over loops, explicit over implicit (e.g.
.copy()when modifying) - •Avoid: Bare
except:, magic numbers