Code Review
Review Python code in *.py files and Jupyter notebook code cells for quality and correctness.
Exclude: _external/ directory.
Check For
- •Bugs: Logic errors, off-by-one, null/None handling, edge cases
- •Clarity: Unclear variable names, overly complex logic, missing context
- •Style: Inconsistent formatting, dead code, unused imports, imports not at top of module, premature line breaks (use 120 char limit)
- •Performance: Unnecessary loops, repeated computations, inefficient patterns
- •Security: Hardcoded secrets, injection risks, unsafe operations
- •Docstrings: Missing or incomplete NumPy-style docstrings
Import Organization
Structure:
# Standard Library import inspect from pathlib import Path # Third-party packages from IPython.display import Code import pandas as pd import numpy as np # Local imports from online_retail_simulator import simulate, load_job_results from support import plot_revenue_by_category
Rules:
- •Group imports by category with comments
- •Alphabetize within groups
- •Single import: no parentheses (keep on one line)
- •Multiple imports: use parentheses with one per line
- •Never use
import *
Single Import:
# Good
from online_retail_simulator.simulate.product_details_mock import simulate_product_details_mock
# Bad - unnecessary parentheses
from online_retail_simulator.simulate.product_details_mock import (
simulate_product_details_mock,
)
Multiple Imports:
# Good
from support import (
plot_revenue_by_category,
plot_daily_metrics_trend,
plot_conversion_funnel,
)
# Bad - too long, hard to read
from support import plot_revenue_by_category, plot_daily_metrics_trend, plot_conversion_funnel
NumPy Docstring Requirements
Module Level
Every Python file should have a module-level docstring explaining its purpose.
Functions
All public functions must have docstrings with these sections (as applicable):
def function_name(param1, param2):
"""
Short one-line summary.
Longer description if needed.
Parameters
----------
param1 : type
Description of param1.
param2 : type, optional
Description of param2. Default is X.
Returns
-------
type
Description of return value.
Raises
------
ExceptionType
When and why this exception is raised.
"""
Classes
All classes must have docstrings with Parameters and Attributes sections.
Style Notes
- •Use imperative mood ("Calculate X" not "Calculates X")
- •Keep summary line under 79 characters
- •Use
optionalfor parameters with defaults
Variable Naming
Use Descriptive Names
# Good
daily_sales = sales.groupby("date").agg({"ordered_units": "sum"})
category_revenue = sales.groupby("category")["revenue"].sum()
enriched_products = products[products["enriched"] == True]
# Bad
ds = sales.groupby("date").agg({"ordered_units": "sum"})
cr = sales.groupby("category")["revenue"].sum()
DataFrame Naming Convention
- •Raw data:
products,sales,customers - •Aggregated:
daily_sales,category_revenue,product_summary - •Filtered:
enriched_products,high_value_customers - •Temporary: Use descriptive names, not
df,temp,data
Avoid Reusing Variable Names
# Bad products = results["products"] products = products[products["price"] > 100] # Overwrites original # Good products = results["products"] high_price_products = products[products["price"] > 100]
Comments and Documentation
Comment the WHY, not the WHAT
# Good # Set seed for reproducible random product selection random.seed(42) # Bad # Set random seed to 42 random.seed(42)
Data Operations
Chain Operations Thoughtfully
# Good (readable chain)
top_products = (
sales
.groupby("product_identifier")["revenue"]
.sum()
.sort_values(ascending=False)
.head(10)
)
# Bad (too long)
result = sales.groupby("product_identifier")["revenue"].sum().sort_values(ascending=False).head(10).reset_index().merge(products, on="product_identifier")
Explicit Column Selection
# Good product_cols = ["product_identifier", "category", "price"] sample_product = products[product_cols].head(1) # Bad sample_product = products.head(1) # What columns are we using?
Print Statements and Output
Formatted Output
# Good
print("=" * 40)
print("DATA SUMMARY")
print("=" * 40)
print(f"Date range: {sales['date'].min()} to {sales['date'].max()}")
print(f"Categories: {sales['category'].nunique()}")
print(f"Total revenue: ${sales['revenue'].sum():,.2f}")
print("=" * 40)
# Bad
print("Date range:", sales['date'].min(), "to", sales['date'].max())
print("Categories:", sales['category'].nunique())
Use f-strings
- •
f"Revenue: ${revenue:,.2f}" - •Not
"Revenue: $" + str(revenue) - •Not
"Revenue: ${}".format(revenue)
Anti-Patterns to Avoid
Mutation Without Clarity
# Bad: Modifying original DataFrame sales["new_column"] = sales["revenue"] * 1.1 # Good: Make intent clear sales = sales.copy() # If mutation intended # OR sales_with_adjustment = sales.assign(new_column=lambda x: x["revenue"] * 1.1)
Hardcoded Values
# Bad high_price = products[products["price"] > 500] # Good PRICE_THRESHOLD = 500 # Define at top of cell or notebook high_price = products[products["price"] > PRICE_THRESHOLD]
Unused Variables
# Bad results = load_job_results(job_info) products = results["products"] sales = results["sales"] metadata = results["metadata"] # Never used
Print Debugging Left In
# Bad
print("DEBUG: products shape:", products.shape) # Remove before committing
Reproducibility
Always Set Seeds
import random random.seed(42) import numpy as np np.random.seed(42) # For pandas sampling sample = df.sample(n=100, random_state=42)
Notebook-Specific Standards
Import Placement in Lecture Notebooks
For Measure Impact lectures that follow the Theory → Application structure, imports should not appear at the beginning of the notebook. Instead, place all imports at the start of Part II (Application) to keep Part I (Theory) completely clean of code.
Exception: The Directed Acyclic Graphs lecture includes simulation code in Part I (Theory) to demonstrate collider bias with the police force example. This is intentional—the simulation reinforces a key theoretical point that benefits from immediate hands-on demonstration.
# Good - imports at start of Application section ## Part II: Application # First code cell of Part II # Standard Library import inspect # Third-party packages import pandas as pd # Local imports from online_retail_simulator import simulate
# Bad - imports at notebook start pollute Theory section # Cell 1: Imports (before any theory content) import pandas as pd from online_retail_simulator import simulate # ... Theory section with no code ...
Rationale: The Theory section should be pure exposition—definitions, notation, and intuition—without any code distractions. Code only enters when we begin the hands-on Application.
One Logical Operation Per Cell
# Cell 1: Run simulation
job_info = simulate("config_simulation.yaml")
# Cell 2: Load results results = load_job_results(job_info) products = results["products"] sales = results["sales"]
Cell Complexity Limits
- •Analysis cells: 5-10 lines
- •Data preparation: 10-15 lines
- •Complex operations: 15-20 lines (break into helper function if longer)
Cell Execution Order
- •Cells should be executable top-to-bottom in sequence
- •No "skip this cell" or "run cell 5 before cell 3" instructions
Inline Shell Commands
Use !cat for displaying file contents (config files, prompts, etc.):
# Good - clear and concise for displaying files ! cat "config_simulation.yaml" ! cat prompt_budget.txt
Avoid shell commands when Python equivalents exist:
# Bad - use Python for file operations
files = ! ls output/
# Good
from pathlib import Path
output_files = list(Path("output").glob("*"))
Output Format
For each issue found:
- •File: path and line number
- •Issue: brief description
- •Suggestion: how to fix it
Checklist
Before finalizing code:
- • Imports organized and grouped by category
- • Single imports: no parentheses; Multiple imports: use parentheses
- • Variable names are descriptive
- • Comments explain WHY, not WHAT
- • Print output is formatted for readability
- • No hardcoded magic numbers
- • Seeds set for reproducibility
- • No unused variables or imports
- • No debug print statements left in
- • DataFrame operations are clear and explicit
- • NumPy-style docstrings on all public functions
Verification
After making changes, always run:
git status hatch run ruff format . hatch run ruff check . hatch run build
This checks for untracked files, formats code, checks for linting issues, builds the documentation and executes all notebooks, confirming that code changes don't break anything.
Success criteria:
- • All new files are under version control (
git statusshows no untracked files that should be committed) - •
hatch run ruff format .completes without changes - •
hatch run ruff check .passes with no errors - •
hatch run buildcompletes successfully