<Use_When>
- •User says "log experiment", "record results", "track results"
- •User asks "what did we try?", "what worked?", "what failed?"
- •User wants to compare experiment runs or parameter configurations
- •User says "reproduce", "reproducibility", "replicate"
- •User is about to run an experiment and wants to capture the setup
- •User finishes an experiment and wants to record outcomes
- •User says "experiment", "run", "trial", "parameter sweep" </Use_When>
<Do_Not_Use_When>
- •Data analysis or statistical testing on results -- use
research-analysisinstead - •Reading or reviewing academic papers -- use
paper-reviewinstead - •General research or literature search -- use
lit-revieworresearchinstead - •One-time script execution without tracking -- just use Bash directly
- •Logging non-experiment events (deploys, incidents) -- use
dev-workflowinstead </Do_Not_Use_When>
<Why_This_Exists> Experiments without systematic logging lead to three critical failures: (1) lost insights when you cannot remember what parameters produced good results, (2) irreproducible results when the environment, commit, or exact command is not captured, and (3) repeated failures when you retry configurations that already failed. Every experiment needs parameters, results, environment context, and a reproducibility trail stored persistently. </Why_This_Exists>
<Execution_Policy>
- •Every experiment gets a unique ID in the format
exp-YYYYMMDD-HHMMSS-<short-hash> - •Always capture git commit hash and dirty status before logging
- •Always capture relevant environment variables and package versions
- •Compare against previous experiments on the same topic automatically
- •Store all experiment data in both the knowledge graph and a local JSON log
- •Default model routing: experiment-tracker agent at sonnet tier
- •Never overwrite existing experiment entries -- append only </Execution_Policy>
- •
Capture Environment: Snapshot the execution context
- •Git commit:
git rev-parse HEADandgit diff --stat - •Branch:
git branch --show-current - •Package versions: language-specific (pip freeze, npm list, etc.)
- •System info: OS, CPU, memory, GPU if applicable
- •Environment variables: Relevant env vars (filtered for secrets)
bashgit rev-parse HEAD git diff --stat python --version 2>/dev/null || node --version 2>/dev/null
- •Git commit:
- •
Execute (optional): Run the experiment command if provided
- •Capture stdout/stderr
- •Record wall-clock time
- •Record exit code
- •If long-running, use
run_in_background: true
- •
Record Results: Log outcomes and observations
- •Primary metric(s) with exact values
- •Secondary metrics
- •Observations and qualitative notes
- •Error messages if the experiment failed
- •Artifacts produced (model files, plots, logs)
codesc_memory_add_relation( from="exp-20240115-143022-a1b2", to="result-accuracy-0.847", type="produced_result" )
- •
Compare with Previous: Query memory for related experiments
codesc_memory_search(query="experiment <topic>", category="experiment") sc_memory_graph_query(query="experiments with method=<method>")
- •Generate comparison table: parameters vs results across runs
- •Highlight improvements, regressions, and anomalies
- •Note which parameter changes correlated with result changes
- •
Store Persistently: Save complete experiment record
- •Knowledge graph: entity + relations
- •Local JSON log:
~/superclaw/data/experiments/<exp-id>.json - •Memory store: searchable text record
codesc_memory_store( content="Experiment <id>: <summary>", category="experiment", confidence=1.0 )
<Tool_Usage>
- •
sc_memory_store-- Save searchable experiment summary - •
sc_memory_search-- Find previous experiments on the same topic - •
sc_memory_add_entity-- Create experiment entity in knowledge graph - •
sc_memory_add_relation-- Link experiments to results, methods, and papers - •
sc_memory_graph_query-- Query experiment history for comparisons - •
Bash-- Capture git state, environment, run experiments, measure timing - •
Write-- Save experiment JSON logs to~/superclaw/data/experiments/ - •
Read-- Load previous experiment logs for comparison - •
Grep-- Search experiment logs for specific parameters or results - •
Glob-- Find experiment log files matching patterns </Tool_Usage>
<Escalation_And_Stop_Conditions>
- •If git state cannot be captured (not a git repo), log without it but warn the user about reduced reproducibility
- •If the experiment command fails, still log the failure with error details (failed experiments are valuable data)
- •If no previous experiments exist for comparison, skip the comparison step and note this is the first run
- •If memory storage fails, save the JSON log locally and retry memory storage later
- •If the user provides incomplete parameters, ask for the missing critical ones before logging </Escalation_And_Stop_Conditions>
<Final_Checklist>
- • Unique experiment ID generated
- • Hypothesis or objective recorded
- • All parameters captured with exact values
- • Git commit hash and dirty status recorded
- • Environment snapshot taken (versions, system info)
- • Results recorded with primary and secondary metrics
- • Comparison with previous experiments generated (if any exist)
- • Experiment entity stored in knowledge graph
- • JSON log saved to ~/superclaw/data/experiments/
- • Searchable summary stored in memory </Final_Checklist>
{
"id": "exp-20240115-143022-a1b2",
"timestamp": "2024-01-15T14:30:22Z",
"hypothesis": "Lower learning rate will improve convergence",
"parameters": {
"learning_rate": 0.001,
"batch_size": 32,
"epochs": 100,
"optimizer": "adam"
},
"environment": {
"git_commit": "abc123def456",
"git_branch": "feature/experiment",
"git_dirty": false,
"python_version": "3.11.5",
"packages": {"torch": "2.1.0", "numpy": "1.24.0"},
"system": {"os": "Darwin 24.6.0", "cpu": "Apple M2", "memory": "16GB"}
},
"command": "python train.py --lr 0.001 --batch 32",
"results": {
"primary": {"accuracy": 0.847, "loss": 0.312},
"secondary": {"training_time_s": 3600, "peak_memory_mb": 4096},
"artifacts": ["models/exp-a1b2.pt", "plots/loss-curve-a1b2.png"]
},
"observations": "Model converged faster than lr=0.01 run. No overfitting observed.",
"status": "completed",
"linked_paper": "attention-is-all-you-need-2017",
"tags": ["transformer", "learning-rate-search"]
}
Comparison Table Format
| Exp ID | LR | Batch | Accuracy | Loss | Time | Status | |-----------|--------|-------|----------|-------|--------|-----------| | exp-a1b2 | 0.001 | 32 | 0.847 | 0.312 | 60min | completed | | exp-c3d4 | 0.01 | 32 | 0.823 | 0.389 | 45min | completed | | exp-e5f6 | 0.001 | 64 | 0.831 | 0.341 | 50min | completed | | exp-g7h8 | 0.1 | 32 | 0.790 | 0.567 | 30min | completed |
Reproducibility Checklist Template
- • Exact command recorded
- • Git commit pinned (clean state preferred)
- • Random seeds fixed and recorded
- • Package versions frozen
- • Data version or hash recorded
- • Hardware specs noted (GPU model if used)
- • Environment variables captured
Automated Parameter Sweeps
For systematic parameter exploration:
- •Define parameter grid:
{lr: [0.001, 0.01, 0.1], batch: [16, 32, 64]} - •Generate experiment entries for each combination
- •Execute sequentially or in parallel (if resources allow)
- •Auto-generate comparison table on completion
- •Highlight Pareto-optimal configurations
Linking to Papers
When an experiment is inspired by or replicates a paper:
sc_memory_add_relation( from="exp-20240115-143022-a1b2", to="attention-is-all-you-need-2017", type="replicates" )
Relation types: replicates, inspired_by, extends, contradicts
Exporting to CSV
Generate CSV from experiment history for external analysis:
# Query all experiments, format as CSV sc_memory_search(query="experiment", category="experiment") # Parse results into CSV format
Output to ~/superclaw/data/experiments/export-YYYYMMDD.csv
Troubleshooting
Experiment ID collision?
- •IDs include timestamp + hash, collisions are extremely unlikely
- •If it happens, append a counter suffix:
-a1b2-2
Git state capture failing?
- •Verify the working directory is a git repository
- •For non-git projects, skip git capture but log a warning
Comparison table too large?
- •Filter by date range, tags, or specific parameters
- •Show only the top N experiments by primary metric