Python Data Science Expert

When to Apply

•Writing Python for data analysis, forecasting, or scientific computing
•Reviewing data science or ML pipeline code
•Adding type hints, docstrings, or error handling to data code

Code Style & Structure

1. Type Hints Everywhere

Use typing extensively: List, Dict, Optional, Union, Tuple. Use dataclasses for structured data. Never omit return type hints.

python

from typing import List, Dict, Optional, Tuple
from dataclasses import dataclass
from datetime import datetime

@dataclass
class TimeSeriesPoint:
    timestamp: datetime
    value: float
    confidence: Optional[float] = None

def process_data(
    df: pd.DataFrame,
    window_size: int = 24
) -> Tuple[np.ndarray, np.ndarray]:
    """Process time series with rolling window."""
    ...

2. Docstrings (Google Style)

Every public function needs a docstring with Args, Returns, Raises, and Example when helpful.

python

def calculate_metrics(
    predictions: np.ndarray,
    actuals: np.ndarray,
    metric_types: List[str]
) -> Dict[str, float]:
    """
    Calculate forecasting error metrics.

    Args:
        predictions: Predicted values array (n_samples,)
        actuals: Actual observed values (n_samples,)
        metric_types: List of metrics to compute ['mae', 'mape', 'rmse']

    Returns:
        Dictionary mapping metric names to computed values

    Raises:
        ValueError: If predictions and actuals have different lengths

    Example:
        >>> pred = np.array([100, 105, 110])
        >>> actual = np.array([102, 103, 112])
        >>> metrics = calculate_metrics(pred, actual, ['mae', 'rmse'])
    """
    if len(predictions) != len(actuals):
        raise ValueError(f"Length mismatch: {len(predictions)} vs {len(actuals)}")
    ...

3. Error Handling

Use specific exception types, not bare except:. Log errors with context. Provide helpful messages.

python

import logging
logger = logging.getLogger(__name__)

try:
    data = load_csv(filepath)
except FileNotFoundError:
    logger.error(f"Data file not found: {filepath}")
    raise
except pd.errors.EmptyDataError:
    logger.error(f"CSV file is empty: {filepath}")
    raise ValueError(f"Cannot process empty file: {filepath}")
except Exception as e:
    logger.exception(f"Unexpected error loading {filepath}")
    raise RuntimeError(f"Failed to load data: {e}") from e

4. Defensive Programming

Validate inputs early. Use assertions for invariants. Check edge cases (empty data, constant columns, nulls). Use defensive copies when modifying.

python

def normalize_timeseries(
    data: pd.DataFrame,
    columns: List[str]
) -> pd.DataFrame:
    """Normalize specified columns to [0, 1] range."""
    assert len(data) > 0, "Cannot normalize empty dataframe"
    assert all(col in data.columns for col in columns), \
        f"Missing columns: {set(columns) - set(data.columns)}"

    null_counts = data[columns].isnull().sum()
    if null_counts.any():
        logger.warning(f"Null values detected: {null_counts[null_counts > 0]}")

    result = data.copy()
    for col in columns:
        col_min, col_max = result[col].min(), result[col].max()
        if col_max == col_min:
            logger.warning(f"Column {col} is constant, setting to 0.5")
            result[col] = 0.5
        else:
            result[col] = (result[col] - col_min) / (col_max - col_min)
    return result

5. Pandas Best Practices

Use vectorized operations; avoid row loops. Chain operations when clear. Use .loc / .iloc explicitly. Be explicit about copies vs views.

python

# GOOD: Vectorized, explicit
df = (
    pd.read_csv('data.csv')
    .assign(
        timestamp=lambda x: pd.to_datetime(x['timestamp']),
        hour=lambda x: x['timestamp'].dt.hour,
        is_weekend=lambda x: x['timestamp'].dt.dayofweek >= 5
    )
    .sort_values('timestamp')
    .reset_index(drop=True)
)
weekend_data = df.loc[df['is_weekend'], ['timestamp', 'load']]

# BAD: Loops, implicit behavior
# for i in range(len(df)): df['hour'][i] = ...  # SettingWithCopyWarning!

6. NumPy Efficiency

Preallocate when size is known. Use appropriate dtypes (float32 vs float64). Leverage broadcasting.

python

n_samples = 1000
results = np.zeros(n_samples, dtype=np.float32)

data = np.random.rand(100, 10)
means = data.mean(axis=0)
normalized = data - means  # Broadcasting

row_sums = large_array.sum(axis=1)  # Prefer over loops

7. Configuration Management

Use dataclasses or Pydantic for config. Load secrets from environment variables. Provide sensible defaults and validate in __post_init__.

python

from dataclasses import dataclass, field
import os

@dataclass
class ForecastConfig:
    forecast_horizon: int = 24
    context_window_hours: int = 168
    models: List[str] = field(default_factory=lambda: ['gemini', 'modal'])
    ensemble_weights: List[float] = field(default_factory=lambda: [0.6, 0.4])
    gemini_api_key: str = field(default_factory=lambda: os.getenv('GEMINI_API_KEY', ''))

    def __post_init__(self):
        assert self.forecast_horizon > 0
        assert len(self.models) == len(self.ensemble_weights)
        assert abs(sum(self.ensemble_weights) - 1.0) < 1e-6

8. Code Organization

Order: standard imports → third-party → local. Constants at module level. No magic numbers; use named constants.

python

# Standard imports first
import os
from pathlib import Path
from typing import List, Dict, Optional, Tuple
from datetime import datetime

# Third-party
import numpy as np
import pandas as pd
from loguru import logger

# Local
from .config import ForecastConfig

# Constants
DEFAULT_FORECAST_HORIZON = 24
MAX_CONTEXT_WINDOW = 720

Checklist When Writing Code

•Always: Type hints on function signatures, docstrings for public functions, explicit error handling with helpful messages, logging of important events/errors, input validation before processing
•Prefer: Vectorized operations over loops, explicit over implicit (e.g. .copy() when modifying)
•Avoid: Bare except:, magic numbers