model
Use this skill for 算法/模型开发/模型微调:从数据到训练到评测再到上线。
Defaults / assumptions to confirm
- •Goal: improve quality, reduce cost/latency, add domain knowledge, safety alignment?
- •Base model and license constraints
- •Hardware: local GPU / multi-GPU / cloud
- •Target inference stack (vLLM, TGI, llama.cpp, etc.)
Workflow
- •Define the objective and success metrics
- •Task definition and input/output format.
- •Primary metrics (task-specific) + guardrails (safety, latency, cost).
- •Failure analysis categories (hallucination, format errors, refusal, toxicity).
- •Data strategy (most important)
- •Collect/curate dataset; define labeling guidelines.
- •Remove duplicates, leakage, PII, and near-duplicates.
- •Balance by scenario; ensure coverage of edge cases.
- •Split train/val/test with strict leakage prevention.
- •Choose training approach
- •SFT for instruction following and domain formatting.
- •LoRA/QLoRA for efficient fine-tuning (default for most cases).
- •DPO/Preference tuning when “style/quality preference” is the target.
- •Avoid fine-tuning when RAG or prompting solves it cheaper.
- •Training setup
- •Pick tokenizer/model family compatibility.
- •Hyperparameters: LR, batch size, sequence length, warmup, weight decay.
- •Checkpoints and resume strategy; deterministic seeds.
- •Track experiments (configs, metrics, artifacts).
- •Evaluation
- •Offline eval set: small but representative; include hard negatives.
- •Automatic metrics where meaningful; human eval for subjective qualities.
- •Regression tests: keep a fixed “golden set” across iterations.
- •Safety & compliance
- •Filter sensitive data; define refusal policy and tests.
- •Measure unsafe outputs; create adversarial eval prompts.
- •Deployment
- •Export adapters/merged weights; document inference requirements.
- •Quantization plan if needed; benchmark latency and throughput.
- •Monitor in production: quality signals, drift, safety incidents.
Outputs
- •Data spec: sources, schema, labeling rules, splits.
- •Training plan: method (SFT/LoRA/DPO), configs, compute estimate.
- •Eval plan: datasets, metrics, sampling, acceptance thresholds.
- •Deployment plan: packaging, quantization, benchmarks, monitoring.