AgentSkillsCN

ML Experiment Tracking

以可复现的参数与度量指标,记录并追踪机器学习实验过程。

SKILL.md
--- frontmatter
name: ML Experiment Tracking
description: Track machine learning experiments with reproducible parameters and metrics
category: ml
version: 1.0.0
triggers:
  - model-training
  - experiment-run
  - hyperparameter-search
globs: "**/models/**,**/experiments/**,**/ml/**"

ML Experiment Tracking Skill

Track machine learning experiments with reproducible parameters and metrics.

Trigger Conditions

  • Model configuration changes or hyperparameter updates
  • New experiment run initiated
  • User invokes with "track experiment" or "compare models"

Input Contract

  • Required: Experiment parameters (model, hyperparameters, data)
  • Required: Evaluation metrics
  • Optional: Baseline comparison, hypothesis

Output Contract

  • Experiment log entry with full reproducibility info
  • Comparison table against baseline/prior runs
  • Recommendation on whether to promote or iterate

Tool Permissions

  • Read: Model configs, training data metadata, metric logs
  • Write: Experiment logs, comparison reports
  • Execute: Metric collection commands

Execution Steps

  1. Record experiment hypothesis and parameters
  2. Capture environment (dependencies, data version, code commit)
  3. Execute or observe training run
  4. Collect metrics and artifacts
  5. Compare against baseline and prior experiments
  6. Recommend: promote, iterate, or abandon

Success Criteria

  • Experiment is fully reproducible from logged parameters
  • Metrics compared against baseline
  • Clear recommendation with rationale

Escalation Rules

  • Escalate if model performance degrades vs. baseline
  • Escalate if data drift detected in training set
  • Escalate if experiment requires new infrastructure

Example Invocations

Input: "Compare the BERT-base and DistilBERT models for our classification task"

Output: Experiment log: BERT-base (F1: 0.92, latency: 45ms, size: 440MB) vs DistilBERT (F1: 0.89, latency: 12ms, size: 260MB). Recommendation: DistilBERT for production (3% F1 trade-off for 73% latency improvement). Promote to staging for A/B test.