Code Review

Review Python code in *.py files and Jupyter notebook code cells for quality and correctness.

Exclude: _external/ directory.

Check For

•Bugs: Logic errors, off-by-one, null/None handling, edge cases
•Clarity: Unclear variable names, overly complex logic, missing context
•Style: Inconsistent formatting, dead code, unused imports, imports not at top of module, premature line breaks (use 120 char limit)
•Performance: Unnecessary loops, repeated computations, inefficient patterns
•Security: Hardcoded secrets, injection risks, unsafe operations
•Docstrings: Missing or incomplete NumPy-style docstrings

Import Organization

Structure:

python

# Standard Library
import inspect
from pathlib import Path

# Third-party packages
from IPython.display import Code
import pandas as pd
import numpy as np

# Local imports
from online_retail_simulator import simulate, load_job_results
from support import plot_revenue_by_category

Rules:

•Group imports by category with comments
•Alphabetize within groups
•Single import: no parentheses (keep on one line)
•Multiple imports: use parentheses with one per line
•Never use import *

Single Import:

python

# Good
from online_retail_simulator.simulate.product_details_mock import simulate_product_details_mock

# Bad - unnecessary parentheses
from online_retail_simulator.simulate.product_details_mock import (
    simulate_product_details_mock,
)

Multiple Imports:

python

# Good
from support import (
    plot_revenue_by_category,
    plot_daily_metrics_trend,
    plot_conversion_funnel,
)

# Bad - too long, hard to read
from support import plot_revenue_by_category, plot_daily_metrics_trend, plot_conversion_funnel

NumPy Docstring Requirements

Module Level

Every Python file should have a module-level docstring explaining its purpose.

Functions

All public functions must have docstrings with these sections (as applicable):

python

def function_name(param1, param2):
    """
    Short one-line summary.

    Longer description if needed.

    Parameters
    ----------
    param1 : type
        Description of param1.
    param2 : type, optional
        Description of param2. Default is X.

    Returns
    -------
    type
        Description of return value.

    Raises
    ------
    ExceptionType
        When and why this exception is raised.
    """

Classes

All classes must have docstrings with Parameters and Attributes sections.

Style Notes

•Use imperative mood ("Calculate X" not "Calculates X")
•Keep summary line under 79 characters
•Use optional for parameters with defaults

Variable Naming

Use Descriptive Names

python

# Good
daily_sales = sales.groupby("date").agg({"ordered_units": "sum"})
category_revenue = sales.groupby("category")["revenue"].sum()
enriched_products = products[products["enriched"] == True]

# Bad
ds = sales.groupby("date").agg({"ordered_units": "sum"})
cr = sales.groupby("category")["revenue"].sum()

DataFrame Naming Convention

•Raw data: products, sales, customers
•Aggregated: daily_sales, category_revenue, product_summary
•Filtered: enriched_products, high_value_customers
•Temporary: Use descriptive names, not df, temp, data

Avoid Reusing Variable Names

python

# Bad
products = results["products"]
products = products[products["price"] > 100]  # Overwrites original

# Good
products = results["products"]
high_price_products = products[products["price"] > 100]

Comments and Documentation

Comment the WHY, not the WHAT

python

# Good
# Set seed for reproducible random product selection
random.seed(42)

# Bad
# Set random seed to 42
random.seed(42)

Data Operations

Chain Operations Thoughtfully

python

# Good (readable chain)
top_products = (
    sales
    .groupby("product_identifier")["revenue"]
    .sum()
    .sort_values(ascending=False)
    .head(10)
)

# Bad (too long)
result = sales.groupby("product_identifier")["revenue"].sum().sort_values(ascending=False).head(10).reset_index().merge(products, on="product_identifier")

Explicit Column Selection

python

# Good
product_cols = ["product_identifier", "category", "price"]
sample_product = products[product_cols].head(1)

# Bad
sample_product = products.head(1)  # What columns are we using?

Print Statements and Output

Formatted Output

python

# Good
print("=" * 40)
print("DATA SUMMARY")
print("=" * 40)
print(f"Date range:      {sales['date'].min()} to {sales['date'].max()}")
print(f"Categories:      {sales['category'].nunique()}")
print(f"Total revenue:   ${sales['revenue'].sum():,.2f}")
print("=" * 40)

# Bad
print("Date range:", sales['date'].min(), "to", sales['date'].max())
print("Categories:", sales['category'].nunique())

Use f-strings

•f"Revenue: ${revenue:,.2f}"
•Not "Revenue: $" + str(revenue)
•Not "Revenue: ${}".format(revenue)

Anti-Patterns to Avoid

Mutation Without Clarity

python

# Bad: Modifying original DataFrame
sales["new_column"] = sales["revenue"] * 1.1

# Good: Make intent clear
sales = sales.copy()  # If mutation intended
# OR
sales_with_adjustment = sales.assign(new_column=lambda x: x["revenue"] * 1.1)

Hardcoded Values

python

# Bad
high_price = products[products["price"] > 500]

# Good
PRICE_THRESHOLD = 500  # Define at top of cell or notebook
high_price = products[products["price"] > PRICE_THRESHOLD]

Unused Variables

python

# Bad
results = load_job_results(job_info)
products = results["products"]
sales = results["sales"]
metadata = results["metadata"]  # Never used

Print Debugging Left In

python

# Bad
print("DEBUG: products shape:", products.shape)  # Remove before committing

Reproducibility

Always Set Seeds

python

import random
random.seed(42)

import numpy as np
np.random.seed(42)

# For pandas sampling
sample = df.sample(n=100, random_state=42)

Notebook-Specific Standards

Import Placement in Lecture Notebooks

For Measure Impact lectures that follow the Theory → Application structure, imports should not appear at the beginning of the notebook. Instead, place all imports at the start of Part II (Application) to keep Part I (Theory) completely clean of code.

Exception: The Directed Acyclic Graphs lecture includes simulation code in Part I (Theory) to demonstrate collider bias with the police force example. This is intentional—the simulation reinforces a key theoretical point that benefits from immediate hands-on demonstration.

python

# Good - imports at start of Application section
## Part II: Application

# First code cell of Part II
# Standard Library
import inspect

# Third-party packages
import pandas as pd

# Local imports
from online_retail_simulator import simulate

python

# Bad - imports at notebook start pollute Theory section
# Cell 1: Imports (before any theory content)
import pandas as pd
from online_retail_simulator import simulate

# ... Theory section with no code ...

Rationale: The Theory section should be pure exposition—definitions, notation, and intuition—without any code distractions. Code only enters when we begin the hands-on Application.

One Logical Operation Per Cell

python

# Cell 1: Run simulation
job_info = simulate("config_simulation.yaml")

python

# Cell 2: Load results
results = load_job_results(job_info)
products = results["products"]
sales = results["sales"]

Cell Complexity Limits

•Analysis cells: 5-10 lines
•Data preparation: 10-15 lines
•Complex operations: 15-20 lines (break into helper function if longer)

Cell Execution Order

•Cells should be executable top-to-bottom in sequence
•No "skip this cell" or "run cell 5 before cell 3" instructions

Inline Shell Commands

Use !cat for displaying file contents (config files, prompts, etc.):

python

# Good - clear and concise for displaying files
! cat "config_simulation.yaml"
! cat prompt_budget.txt

Avoid shell commands when Python equivalents exist:

python

# Bad - use Python for file operations
files = ! ls output/

# Good
from pathlib import Path
output_files = list(Path("output").glob("*"))

Output Format

For each issue found:

•File: path and line number
•Issue: brief description
•Suggestion: how to fix it

Checklist

Before finalizing code:

Verification

After making changes, always run:

bash

git status
hatch run ruff format .
hatch run ruff check .
hatch run build

This checks for untracked files, formats code, checks for linting issues, builds the documentation and executes all notebooks, confirming that code changes don't break anything.

Success criteria:

• All new files are under version control (git status shows no untracked files that should be committed)
• hatch run ruff format . completes without changes
• hatch run ruff check . passes with no errors
• hatch run build completes successfully