Analyze all VIF experiment logs and produce a structured cross-run comparison report. This skill is read-only — it interprets existing data and outputs findings to the conversation.
Data Collection
Step 1: Read the index
Read logs/experiments/index.md for the high-level summary table.
Step 2: Read all run files
Use Glob to find logs/experiments/runs/*.yaml, then read every file. Extract:
- •
metadata(run_id, model_name) - •
config(encoder, state_encoder, model, training) - •
data(n_train, pct_truncated, state_dim) - •
capacity(n_parameters, param_sample_ratio) - •
training_dynamics(best_epoch, gap_at_best) - •
evaluation(all aggregate metrics) - •
per_dimension(all 10 Schwartz dimensions)
Step 3: Identify axes of variation
Group runs by what changed between them:
- •Encoder: model name, embedding dimension, truncation
- •Capacity: hidden_dim, param count, param/sample ratio
- •Loss function: within a run, compare across loss heads
- •State encoder: window_size, state_dim
Anything identical across all runs is a constant — mention once, don't repeat.
Metric Interpretation Thresholds
Use these thresholds when characterizing results. Always cite the actual number alongside the qualitative label.
| Metric | Poor | Fair | Moderate | Good/Substantial |
|---|---|---|---|---|
| QWK | < 0.2 | 0.2 – 0.4 | 0.4 – 0.6 | > 0.6 |
| Calibration | < 0 (dangerous) | 0 – 0.1 (useless) | 0.1 – 0.3 (weak) | 0.3 – 0.6 (moderate), > 0.6 (good) |
| Hedging % | — | — | 60 – 80% (moderate) | < 60% (decisive) |
| Hedging % (excessive) | > 80% | — | — | — |
| Minority Recall | < 10% (ignores rare) | 10 – 30% (poor) | > 30% (reasonable) | — |
| Param/Sample Ratio | > 100 (severe) | 10 – 100 (high) | 1 – 10 (moderate) | < 1 (efficient) |
| Training Gap | > 0.5 (overfitting) | 0.1 – 0.5 (some) | < 0.1 (good) | — |
| Spearman | < 0.3 (weak) | 0.3 – 0.5 (moderate) | > 0.5 (strong) | — |
Report Structure
Produce the report in exactly these 8 sections. Cap the report at ~800 words excluding tables. Cite specific numbers and use run IDs (e.g., run_001, run_002). When two runs differ by < 5% on a metric, say "comparable" rather than declaring a winner.
1. Experiment Overview
- •What varied across runs (encoder, capacity, loss, etc.)
- •What stayed constant (training hyperparameters, data splits, seed)
- •Dataset size and any data notes (e.g., truncation %)
2. Head-to-Head Comparison
For each axis of variation, produce a compact comparison table. Flag the winner per metric. Use bold for the better value. Example:
| Metric | run_001 (MiniLM-384d) | run_002 (nomic-256d) | Delta |
|---|
Cover: MAE, Accuracy, QWK, Spearman, Calibration, Minority Recall, Hedging.
If there are multiple loss functions, also compare within each run across losses.
3. Per-Dimension Analysis
Identify:
- •Easy dimensions: consistently high QWK (> 0.4) across runs
- •Hard dimensions: consistently low QWK (< 0.3) across runs
- •Volatile dimensions: large QWK variance across runs
Present as a compact table sorted by mean QWK across all runs.
4. Calibration Deep-Dive
- •How many dimensions have positive calibration per run?
- •Global calibration comparison
- •Flag any dimension with calibration < -0.4 as a deployment risk
- •Note if negative calibration is systematic (model always over/under-confident)
5. Hedging vs Minority Recall Trade-off
- •For each loss function, plot the hedging % against minority recall
- •Identify which losses achieve low hedging AND reasonable minority recall
- •Flag any loss where hedging > 60% (over-predicting the majority class)
6. Capacity & Overfitting
- •Compare param/sample ratios and characterize using thresholds
- •Compare training gaps (train_loss - val_loss at best epoch)
- •Note best_epoch vs total_epochs (did early stopping trigger appropriately?)
- •Flag if a larger model overfits more than a smaller one
7. Actionable Recommendations
Provide 3–5 concrete, motivated, testable next steps. Each should:
- •Reference the specific evidence from the analysis
- •Be scoped to a single experiment or change
- •Include what metric improvement to watch for
Examples of good recommendations:
- •"Try nomic-embed with hidden_dim=128 (between 32 and 256) to find the capacity sweet spot — watch param/sample ratio and training gap"
- •"Investigate why power dimension has near-zero QWK across all runs — check label distribution"
8. Summary Verdict
- •Best config: which run_id + loss combination looks most promising and why
- •Key weakness: the single biggest limitation across all experiments
- •Highest-leverage next experiment: one thing to try that would most improve results
Style Constraints
- •Use the threshold table above for all qualitative characterizations
- •Always cite the actual number: "QWK 0.42 (moderate)" not just "moderate QWK"
- •Context: this is a capstone POC with ~637 training samples — focus on relative comparisons, not absolute benchmarks
- •Do not editorialize about the project or its goals — stick to the data
- •If a metric is missing or an observations field says
<fill in>, note it but don't speculate