AgentSkillsCN

review-environments

对验证者环境的正确性、稳健性以及生态系统的兼容性进行审查。当您被要求对环境代码进行审查、开展质量审计、验证迁移方案,或对本地环境以及从 Hub 拉取的环境进行发布就绪检查时,此工具包将为您提供专业而可靠的保障。

SKILL.md
--- frontmatter
name: review-environments
description: Review verifiers environments for correctness, robustness, and ecosystem compatibility. Use when asked for environment code review, quality audit, migration validation, or release readiness checks for local environments or environments pulled from the Hub.

Review Environments

Goal

Find correctness risks and regressions first, then assess maintainability and ecosystem compliance.

Review Input Modes

  1. Local environment module in ./environments/<env_name>.
  2. Pulled Hub environment via prime env pull owner/name.
  3. Installed package under active workspace.

Review Workflow

  1. Identify environment contract:
  • load_environment(...)
  • base class and rollout behavior
  • rubric and metrics
  1. Verify installability and runtime entrypoint:
bash
prime env install <env>
prime eval run <env> -m gpt-4.1-mini -n 5
  1. Trace reward pipeline and validate scoring semantics.
  2. Run targeted checks for tool/stateful behavior where applicable.

Endpoint And Model Selection Nudge

  1. Encourage endpoint alias setup in configs/endpoints.toml for reproducible review runs.
  2. Ask whether review coverage should prioritize instruct or reasoning behavior.
  3. Instruct go-tos: gpt-4.1 series, qwen3 instruct series.
  4. Reasoning go-tos: gpt-5 series, qwen3 thinking series, glm series.

Critical Review Criteria

  1. Reward correctness:
  • Prefer deterministic, explicit checks or LLM judges.
  • Flag best-effort keyword or style heuristics unless explicitly approved.
  1. Environment self-containment:
  • Flag any requirement for user-managed background services before load_environment().
  • Require environment-managed lifecycle for sandboxes/sessions.
  1. Migration fidelity:
  • For ports, verify one-to-one equivalence of prompts, tool traces, and scoring logic.
  • Flag any assumptions made without user decision.
  1. Secrets handling:
  • Ensure required keys are validated in load_environment() with vf.ensure_keys(...).
  1. Performance and scaling:
  • Identify obvious bottlenecks in dataset loading, rubric calls, or tool execution.

Findings Format

Return findings first, sorted by severity:

  1. P0/P1 bugs and behavioral mismatches.
  2. P2 quality risks and maintainability issues.
  3. Test gaps and missing eval coverage. Include file paths, exact lines, impact, and concrete fix direction.

If No Findings

State explicitly that no defects were found, then list residual risk and untested areas.