Skill: Vega-Lite Analysis Pipeline
| Attribute | Value |
|---|---|
| Date | 2026-01-30 |
| Project | ProjectScylla |
| Objective | Build a complete analysis pipeline for experiment results with text-based figures and statistical tables |
| Outcome | ✅ Success - 15 figures + 7 tables, all outputs text-based and version-controllable |
| PR | #213 (merged) |
| LOC Added | 6,229 lines |
When to Use This Skill
Use this approach when you need to:
- •
Generate publication-quality figures for research papers
- •Figures must be version-controllable (text, not binary)
- •Need both interactive viewing and paper rendering (PNG/PDF/SVG)
- •Want consistent styling across all visualizations
- •
Create statistical analysis pipelines for experiment data
- •Multiple data formats (runs, judges, criteria, subtests)
- •Complex hierarchical data structures (experiments → tiers → subtests → runs)
- •Need comprehensive statistical tests (bootstrap CI, Mann-Whitney U, etc.)
- •
Avoid gitignore conflicts with generated outputs
- •Figures and tables are generated at runtime, not committed
- •Raw data stays portable (CSV format)
- •Specs are text (JSON), not images
- •
Support multiple output consumers
- •Paper authors (need LaTeX tables, PDF figures)
- •Data analysts (need CSV exports)
- •Web viewers (need interactive Vega-Lite specs)
- •Automation (need consistent APIs)
Verified Workflow
1. Architecture: Text-Based Outputs Strategy
Key Decision: Use Vega-Lite JSON specifications as the primary figure format.
Rationale:
- •✅ Text-based (JSON) → version-controllable
- •✅ Portable → works in Vega Editor, Python, R, web browsers
- •✅ Self-contained → data can be inlined or separate CSV
- •✅ Multi-format rendering → PNG/PDF/SVG via
vl-convert - •✅ Interactive → tooltips, zoom, pan in web viewers
Implementation:
# Use altair (Python Vega-Lite API) to build specs programmatically
import altair as alt
chart = alt.Chart(data).mark_bar().encode(
x='tier:O',
y='pass_rate:Q',
color='model:N'
)
# Save as JSON spec
chart.save("figure.vl.json")
# Also save data separately
data.to_csv("figure.csv")
# Optional: render to PNG/PDF
chart.save("figure.png", scale_factor=2.0) # 300 DPI
chart.save("figure.pdf")
Result: All 15 figures saved as .vl.json + .csv, with optional PNG/PDF rendering.
2. Data Loading: Handle Corrupted/Incomplete Runs Gracefully
Challenge: Real experiment data has failures (missing files, corrupted JSON, incomplete judges).
Solution: Try-except with informative warnings, skip bad runs, continue processing.
def load_run(run_dir: Path, experiment: str, tier: str, subtest: str) -> RunData:
try:
# Load run_result.json
with (run_dir / "run_result.json").open() as f:
result = json.load(f)
# Load per-judge evaluations
judges = []
judge_dir = run_dir / "judge"
if judge_dir.exists():
for judge_num in [1, 2, 3]:
judge_path = judge_dir / f"judge_0{judge_num}" / "judgment.json"
if judge_path.exists():
judges.append(load_judgment(judge_path, judge_num))
return RunData(...)
except Exception as e:
print(f"Warning: Failed to load {run_dir}: {e}")
# Return None or raise to skip this run
Result: 97.4% load success rate (2,238 of 2,298 runs loaded), graceful degradation.
3. DataFrame Architecture: Four Core DataFrames
Pattern: Build hierarchical views of the same data for different analysis levels.
# 1. Runs DataFrame (one row per run) runs_df = build_runs_df(experiments) # 2,238 rows # Columns: experiment, agent_model, tier, subtest, run_number, score, passed, cost, tokens, etc. # 2. Judges DataFrame (one row per judge evaluation) judges_df = build_judges_df(experiments) # 6,216 rows (3 judges × 2,238 runs) # Columns: (all run columns) + judge_model, judge_number, judge_score, judge_passed, etc. # 3. Criteria DataFrame (one row per criterion score) criteria_df = build_criteria_df(experiments) # 30,929 rows (5 criteria × 6,216 judges) # Columns: (all judge columns) + criterion, criterion_score, criterion_achieved, etc. # 4. Subtests DataFrame (pre-aggregated) subtests_df = build_subtests_df(runs_df) # 226 rows # Columns: experiment, tier, subtest, pass_rate, mean_score, consistency, cop, etc.
Why This Works:
- •✅ Different levels of granularity for different analyses
- •✅ Easy filtering:
runs_df[runs_df['tier'] == 'T0'] - •✅ Aggregation helpers reuse the same base data
- •✅ CSV export gives analysts full flexibility
4. Statistical Functions: Non-Parametric by Default
Observation: Scores are bounded [0, 1] and often bimodal → parametric tests inappropriate.
Solution: Use non-parametric tests throughout.
# Bootstrap 95% CI (no distribution assumptions) mean, ci_low, ci_high = bootstrap_ci(data, n_resamples=10000) # Mann-Whitney U (non-parametric significance test) u_stat, pvalue = mann_whitney_u(group1, group2) # Cliff's delta (non-parametric effect size) delta = cliffs_delta(group1, group2) # Interpretation: |δ| < 0.147 negligible, < 0.33 small, < 0.474 medium, ≥0.474 large # Krippendorff's alpha (inter-rater reliability, ordinal scale) alpha = krippendorff_alpha(ratings_matrix, level="ordinal")
Result: Robust statistics appropriate for bounded, potentially non-normal data.
5. Table Generation: Markdown + LaTeX Dual Output
Pattern: Generate both formats from the same data, single source of truth.
def table01_tier_summary(runs_df: pd.DataFrame) -> tuple[str, str]:
# Compute statistics
stats = compute_tier_stats(runs_df)
# Generate Markdown
md_lines = ["# Table 1: Tier Summary", ""]
md_lines.append("| Model | Tier | Pass Rate | ... |")
md_lines.append("|-------|------|-----------|-----|")
for _, row in stats.iterrows():
md_lines.append(f"| {row['Model']} | {row['Tier']} | {row['Pass Rate']:.3f} | ... |")
markdown = "\n".join(md_lines)
# Generate LaTeX
latex_lines = [
r"\begin{table}[htbp]",
r"\centering",
r"\caption{Tier Summary Statistics}",
r"\begin{tabular}{llrrr}",
r"\toprule",
r"Model & Tier & Pass Rate & ... \\",
r"\midrule",
]
for _, row in stats.iterrows():
latex_lines.append(f"{row['Model']} & {row['Tier']} & {row['Pass Rate']:.3f} & ... \\\\")
latex_lines.extend([r"\bottomrule", r"\end{tabular}", r"\end{table}"])
latex = "\n".join(latex_lines)
return markdown, latex
Result: 7 tables × 2 formats = 14 files, all generated from same data.
6. Master Orchestrator Script Pattern
Structure: Separate scripts for each step, master script calls them all.
scripts/ ├── export_data.py # Step 1: Export CSVs ├── generate_figures.py # Step 2: Generate 15 figures ├── generate_tables.py # Step 3: Generate 7 tables └── generate_all_results.py # Master: Runs 1→2→3 in sequence
Master script pattern:
def run_script(script_name: str, args: list[str], description: str) -> bool:
cmd = ["pixi", "run", "-e", "analysis", "python", script_name, *args]
result = subprocess.run(cmd, check=True)
return result.returncode == 0
# Run in sequence
success = True
if not args.skip_data:
success = run_script("scripts/export_data.py", [...], "Step 1/3: Export data")
if not args.skip_figures and success:
success = run_script("scripts/generate_figures.py", [...], "Step 2/3: Figures")
if not args.skip_tables and success:
success = run_script("scripts/generate_tables.py", [...], "Step 3/3: Tables")
Benefits:
- •✅ Can run individual steps for debugging
- •✅ Can skip steps (e.g.,
--skip-dataif CSVs already exported) - •✅ Fail-fast: stops on first error
- •✅ Clear progress reporting
7. Gitignore Strategy: Exclude All Generated Outputs
Pattern: Generated files (CSV, JSON, images, tables) go in .gitignore.
# Analysis pipeline outputs (generated at runtime) docs/data/*.csv docs/data/*.json docs/figures/*.vl.json docs/figures/*.csv docs/figures/*.png docs/figures/*.pdf docs/figures/*.svg docs/tables/*.md docs/tables/*.tex
Rationale:
- •✅ Generated files are runtime artifacts, not source code
- •✅ Avoids git bloat (2,238 rows of CSV would be huge)
- •✅ Forces reproducibility (can't commit stale outputs)
- •✅ Users generate fresh outputs from latest data
Important: Remove any previously committed generated files:
git rm --cached -r docs/data/ docs/figures/ docs/tables/
8. Color Palette: Consistent Across All Figures
Pattern: Define color palette once, reuse everywhere.
# figures/__init__.py
COLORS = {
"models": {
"Sonnet 4.5": "#4C78A8", # Blue
"Haiku 4.5": "#E45756", # Red
},
"tiers": {
"T0": "#66c2a5", "T1": "#fc8d62", "T2": "#8da0cb",
"T3": "#e78ac3", "T4": "#a6d854", "T5": "#ffd92f", "T6": "#e5c494"
},
"grades": {
"S": "#FFD700", "A": "#2ecc71", "B": "#3498db",
"C": "#f39c12", "D": "#e67e22", "F": "#e74c3c"
},
}
# Usage in figures
alt.Color("model:N", scale=alt.Scale(
domain=list(COLORS["models"].keys()),
range=list(COLORS["models"].values())
))
Result: All 15 figures use the same colors for the same entities → visual consistency.
9. Publication Theme: Serif Fonts, Clean Axes
Pattern: Register a custom Altair theme once, applies to all charts.
def apply_publication_theme() -> None:
theme = {
"config": {
"font": "serif",
"axis": {
"labelFontSize": 11,
"titleFontSize": 13,
"gridColor": "#e0e0e0",
},
"legend": {
"labelFontSize": 11,
"titleFontSize": 12,
},
"title": {
"fontSize": 14,
"anchor": "start",
},
"view": {"stroke": None},
}
}
alt.themes.register("publication", lambda: theme)
alt.themes.enable("publication")
# Call once in spec_builder.py
apply_publication_theme()
Result: All figures have serif fonts (academic papers), clean axes, no border.
Failed Attempts
❌ Attempt 1: Matplotlib as Primary Figure Format
What We Tried: Generate PNG images directly using matplotlib, commit to git.
Why It Failed:
- •❌ Binary PNG files in git → huge repository bloat
- •❌ Not version-controllable (can't see diffs)
- •❌ Hard to iterate (need to regenerate all images after small change)
- •❌ Not interactive (static images only)
Lesson: Use text-based formats (Vega-Lite JSON) as primary, render to images optionally.
❌ Attempt 2: Inline Data in Vega-Lite Specs (Too Large)
What We Tried: Put all data directly inside .vl.json files.
Why It Failed:
- •❌ Some figures have 2,238 rows → 500KB+ JSON files
- •❌ Hard to inspect data separately from spec
- •❌ Can't easily use data in other tools (R, Excel)
Solution: Save data as separate .csv files, reference from spec OR inline for small datasets.
# For small data: inline
chart = alt.Chart(data) # Data goes inside .vl.json
# For large data: separate CSV
chart = alt.Chart("figure.csv") # Reference external file
data.to_csv(output_dir / f"{name}.csv")
Result: Each figure has both .vl.json (spec) and .csv (data) for flexibility.
❌ Attempt 3: f-string Escaping for Special Characters
What We Tried: Use f"... {row['Cliff\\'s δ']} ..." in f-strings.
Why It Failed:
# SyntaxError: invalid syntax
f"{row['Cliff\\'s δ']:+.3f}"
Solution: Extract value to variable first.
cliffs_delta_val = row["Cliff's δ"]
f"{cliffs_delta_val:+.3f}"
Lesson: Don't escape quotes in f-string dictionary keys → extract to variable.
❌ Attempt 4: Ruff Linting - Docstring Imperative Mood
What We Tried: Use "Main entry point." for main() docstrings.
Why It Failed:
D401 First line of docstring should be in imperative mood: "Main entry point."
Solution: Use imperative mood.
# Bad
def main() -> None:
"""Main entry point."""
# Good
def main() -> None:
"""Run the script to generate results."""
Lesson: Ruff enforces imperative mood for docstrings (command, not statement).
❌ Attempt 5: Variable Names - Uppercase Grade Variables
What We Tried: Use grade_S, grade_A, etc. for clarity.
Why It Failed:
N806 Variable `grade_S` in function should be lowercase
Solution: Use lowercase with underscores.
# Bad
grade_S = grade_counts.get("S", 0)
# Good
grade_s = grade_counts.get("S", 0)
Lesson: Ruff enforces snake_case for all variables, even when uppercase seems clearer.
❌ Attempt 6: Long LaTeX Lines in Tables
What We Tried: Single-line LaTeX table headers for readability.
Why It Failed:
E501 Line too long (121 > 100)
Solution: Split across multiple lines.
# Bad (121 chars) r"Model & Tier & Subtests & Pass Rate (95\% CI) & Mean Score ($\pm\sigma$) & Median & Consistency & CoP (\$) \\" # Good (split) r"Model & Tier & Subtests & Pass Rate (95\% CI) & " r"Mean Score ($\pm\sigma$) & Median & Consistency & CoP (\$) \\"
Lesson: Ruff enforces 100-char line limit even for LaTeX strings.
❌ Attempt 7: Layer + Facet in Altair (Not Allowed)
What We Tried: Combine layered charts (scatter + line) with faceting.
# Failed chart = (scatter + line).facet(row="judge_y:N", column="judge_x:N") # Error: Facet charts require data to be specified at the top level
Why It Failed: Altair doesn't allow faceting on layered charts directly.
Solution: Use faceting without layers, or use repeat instead of facet.
# Good chart = scatter.facet(row="judge_y:N", column="judge_x:N") # Skip the diagonal line, or add it in post-processing
Lesson: Altair has restrictions on combining layers with facets.
Results & Parameters
Final Output Counts
| Category | Count | Formats |
|---|---|---|
| Data Files | 5 | CSV, JSON |
| Figures | 15 | .vl.json + .csv per figure |
| Tables | 7 | .md + .tex per table |
| Scripts | 4 | Python executable |
| Modules | 13 | Python source |
Data Statistics
From 2,238 loaded runs:
| Metric | Overall | Sonnet 4.5 | Haiku 4.5 |
|---|---|---|---|
| Pass Rate | 83.9% | 94.2% | 73.4% |
| Mean Score | 0.786 | 0.908 | 0.662 |
| Mean Cost/Run | $0.060 | $0.077 | $0.043 |
| Load Success | 97.4% | — | — |
Dependencies (pyproject.toml)
[project.optional-dependencies]
analysis = [
"matplotlib>=3.8",
"numpy>=1.24",
"pandas>=2.0",
"seaborn>=0.13",
"scipy>=1.11",
"altair>=5.0",
"vl-convert-python>=1.0",
]
Pixi Environment
[feature.analysis.pypi-dependencies]
matplotlib = ">=3.8"
numpy = ">=1.24"
pandas = ">=2.0"
seaborn = ">=0.13"
scipy = ">=1.11"
altair = ">=5.0"
vl-convert-python = ">=1.0"
[environments]
analysis = { features = ["dev", "analysis"], solve-group = "default" }
Usage Commands
# Install environment
pixi install -e analysis
# Generate everything
pixi run -e analysis python scripts/generate_all_results.py
# Generate with PNG/PDF rendering
pixi run -e analysis python scripts/generate_all_results.py --no-render=false
# Generate specific components
pixi run -e analysis python scripts/export_data.py # Data only
pixi run -e analysis python scripts/generate_figures.py # Figures only
pixi run -e analysis python scripts/generate_tables.py # Tables only
# View outputs
# - Data: docs/data/*.csv (Excel, Python, R)
# - Figures: docs/figures/*.vl.json (Vega Editor: https://vega.github.io/editor/)
# - Tables: docs/tables/*.{md,tex}
Key Takeaways
Architecture Decisions
- •Text-Based Outputs First: Use Vega-Lite JSON as primary format, render to images optionally
- •Four DataFrame Hierarchy: runs → judges → criteria, plus pre-aggregated subtests
- •Non-Parametric Statistics: Bootstrap CI, Mann-Whitney U, Cliff's delta for bounded data
- •Dual Table Formats: Generate Markdown + LaTeX from same data
- •Modular Scripts: Separate scripts per step, master orchestrator
Code Quality
- •Graceful Degradation: Try-except on data loading, 97.4% success rate
- •Consistent Styling: Single color palette, publication theme applied once
- •Version Control: Exclude generated files, commit only source code
- •Linting: Imperative docstrings, lowercase variables, 100-char lines
Performance
- •Load Time: ~30s to load 2,238 runs with 13,560 JSON files
- •Figure Generation: ~60s for all 15 figures (Vega-Lite is fast)
- •Table Generation: ~10s for all 7 tables
- •Total Pipeline: ~2 minutes for complete analysis
Future Improvements
- •Parallel Figure Generation: Use multiprocessing to generate figures in parallel
- •Incremental Updates: Only regenerate changed figures/tables
- •Interactive Dashboard: Streamlit/Plotly Dash web interface
- •Automated Narrative: Generate text snippets for paper sections
- •Cross-Experiment Comparison: Load multiple experiment sets, compare
Related Skills
- •
experiment-recovery-tools: Handles failed runs and judge retries - •
parallel-io-executor: Optimizes parallel I/O operations - •
statistical-analysis: Comprehensive statistical test library (if exists)
Questions This Skill Answers
- •How do I create version-controllable figures for papers? → Use Vega-Lite JSON
- •How do I handle corrupted experiment data? → Try-except with graceful degradation
- •How do I generate both Markdown and LaTeX tables? → Single function, dual output
- •How do I avoid git bloat from generated files? → Gitignore + separate data/specs
- •How do I ensure visual consistency across figures? → Single color palette + theme
- •How do I support multiple output consumers? → CSV data + Vega-Lite specs + LaTeX tables
- •How do I structure complex analysis pipelines? → Modular scripts + master orchestrator
- •What statistical tests for bounded [0,1] data? → Non-parametric (Mann-Whitney U, bootstrap CI)
Created: 2026-01-30 Project: ProjectScylla PR: #213 (merged) Success Rate: 100% (all outputs generated, linting passed, PR merged)