Data Analysis Workflow

Run an end-to-end data analysis in R: load, explore, analyze, and produce publication-ready output.

Input: A dataset path (e.g., data/county_panel.csv) or a description of the analysis goal.

Constraints

•Follow R code conventions in code/AGENTS.md (Codex) or .claude/rules/r-code-conventions.md (Claude)
•Save all scripts to the appropriate code/[task_group]/ directory
•Save all outputs to output/ (figures, tables, numbers subdirs)
•Use saveRDS() for every computed object
•Use project theme for all figures
•Run /review-r on the generated script before presenting results

Workflow Phases

Phase 1: Setup and Data Loading

•Read code conventions for project standards
•Create R script with proper header (title, author, purpose, inputs, outputs)
•Load required packages at top (library(), never require())
•Set seed once at top in YYYYMMDD format: set.seed(20260211)
•Load and inspect the dataset

Phase 2: Exploratory Data Analysis

Generate diagnostic outputs:

•Summary statistics: summary(), missingness rates, variable types
•Distributions: Histograms for key continuous variables
•Relationships: Scatter plots, correlation matrices
•Time patterns: If panel data, plot trends over time
•Group comparisons: If treatment/control, compare pre-treatment means

Save all diagnostic figures to output/figures/.

Phase 3: Main Analysis

Based on the research question:

•Regression analysis: Use fixest for panel data, lm/glm for cross-section
•Standard errors: Cluster at the appropriate level (document why)
•Multiple specifications: Start simple, progressively add controls
•Effect sizes: Report standardized effects alongside raw coefficients

Phase 4: Publication-Ready Output

Tables:

•Use modelsummary for regression tables (preferred) or stargazer
•Include all standard elements: coefficients, SEs, significance stars, N, R-squared
•Export as .tex for LaTeX inclusion and .html for quick viewing

Figures:

•Use ggplot2 with project theme
•Set bg = "transparent" for LaTeX compatibility
•Include proper axis labels (sentence case, units)
•Export with explicit dimensions: ggsave(width = X, height = Y)
•Save as both .pdf and .png

Phase 5: Save and Review

•saveRDS() for all key objects
•Rely on the Makefile for directory creation (do not call dir.create() in scripts)
•Run /review-r on the generated script
•Address any Critical or High issues from the review

Script Structure

# ============================================================
# [Descriptive Title]
# Author: [from project context]
# Purpose: [What this script does]
# Inputs: [Data files]
# Outputs: [Figures, tables, RDS files]
# ============================================================

# 0. Setup ----
library(tidyverse)
library(fixest)
library(modelsummary)

set.seed(20260211)  # YYYYMMDD format

# Note: output directories are created by the Makefile, not the script

# 1. Data Loading ----
# 2. Exploratory Analysis ----
# 3. Main Analysis ----
# 4. Tables and Figures ----
# 5. Export ----

Important

•Reproduce, don't guess. If the user specifies a regression, run exactly that.
•Show your work. Print summary statistics before jumping to regression.
•Check for issues. Look for multicollinearity, outliers, perfect prediction.
•Use relative paths. All paths relative to repository root.
•No hardcoded values. Use variables for sample restrictions, date ranges, etc.