Data Analysis Workflow
Run an end-to-end data analysis in R: load, explore, analyze, and produce publication-ready output.
Input: $ARGUMENTS — a dataset path (e.g., data/county_panel.csv) or a description of the analysis goal (e.g., "regress wages on education with state fixed effects using CPS data").
Constraints
- •Follow R code conventions in
.claude/rules/r-code-conventions.md - •Save all scripts to the appropriate
code/[task_group]/directory - •Save all outputs to
output/(figures, tables, numbers subdirs) - •Use
saveRDS()for every computed object - •Use project theme for all figures (check for custom theme in
.claude/rules/) - •Run r-reviewer on the generated script before presenting results
Workflow Phases
Phase 1: Setup and Data Loading
- •Read
.claude/rules/r-code-conventions.mdfor project standards - •Create R script with proper header (title, author, purpose, inputs, outputs)
- •Load required packages at top (
library(), neverrequire()) - •Set seed once at top in YYYYMMDD format:
set.seed(20260211) - •Load and inspect the dataset
Phase 2: Exploratory Data Analysis
Generate diagnostic outputs:
- •Summary statistics:
summary(), missingness rates, variable types - •Distributions: Histograms for key continuous variables
- •Relationships: Scatter plots, correlation matrices
- •Time patterns: If panel data, plot trends over time
- •Group comparisons: If treatment/control, compare pre-treatment means
Save all diagnostic figures to output/figures/.
Phase 3: Main Analysis
Based on the research question:
- •Regression analysis: Use
fixestfor panel data,lm/glmfor cross-section - •Standard errors: Cluster at the appropriate level (document why)
- •Multiple specifications: Start simple, progressively add controls
- •Effect sizes: Report standardized effects alongside raw coefficients
Phase 4: Publication-Ready Output
Tables:
- •Use
modelsummaryfor regression tables (preferred) orstargazer - •Include all standard elements: coefficients, SEs, significance stars, N, R-squared
- •Export as
.texfor LaTeX inclusion and.htmlfor quick viewing
Figures:
- •Use
ggplot2with project theme - •Set
bg = "transparent"for LaTeX compatibility - •Include proper axis labels (sentence case, units)
- •Export with explicit dimensions:
ggsave(width = X, height = Y) - •Save as both
.pdfand.png
Phase 5: Save and Review
- •
saveRDS()for all key objects (regression results, summary tables, processed data) - •Rely on the Makefile for directory creation (do not call
dir.create()in scripts) - •Run the r-reviewer agent on the generated script:
code
Delegate to the r-reviewer agent: "Review the script at code/[task_group]/[script_name].R"
- •Address any Critical or High issues from the review.
Script Structure
Follow this template:
r
# ============================================================ # [Descriptive Title] # Author: [from project context] # Purpose: [What this script does] # Inputs: [Data files] # Outputs: [Figures, tables, RDS files] # ============================================================ # 0. Setup ---- library(tidyverse) library(fixest) library(modelsummary) set.seed(20260211) # YYYYMMDD format per r-code-conventions # Note: output directories are created by the Makefile, not the script # 1. Data Loading ---- # [Load and clean data] # 2. Exploratory Analysis ---- # [Summary stats, diagnostic plots] # 3. Main Analysis ---- # [Regressions, estimation] # 4. Tables and Figures ---- # [Publication-ready output] # 5. Export ---- # [saveRDS for all objects, ggsave for all figures]
Important
- •Reproduce, don't guess. If the user specifies a regression, run exactly that.
- •Show your work. Print summary statistics before jumping to regression.
- •Check for issues. Look for multicollinearity, outliers, perfect prediction.
- •Use relative paths. All paths relative to repository root.
- •No hardcoded values. Use variables for sample restrictions, date ranges, etc.