AgentSkillsCN

Rubric Evaluation

完整剖析 400 分的“隐形编码测试”评分体系,涵盖代码质量、功能实现、测试覆盖与文档编写四大维度,生成结构化的评分卡,逐项列出子准则得分、佐证材料,并优先提出改进建议。

SKILL.md
--- frontmatter
name: Rubric Evaluation
description: Walk through the full 400-point Invisible coding test rubric evaluation across Code Quality, Functionality, Testing, and Documentation, producing a structured scorecard with per-sub-criterion scores, evidence, and prioritized improvements.
category: evaluation
version: 1.0.0
triggers:
  - rubric-evaluate-command
  - rubric-category-command
  - manual-invocation
globs: "**/*.go,**/*_test.go,**/README.md,**/INSTRUCTIONS.md,**/Makefile"

Rubric Evaluation Skill

Evaluate a Go project against the Invisible coding test rubric (4 categories, 400 total points).

Trigger Conditions

  • User runs /rubric-evaluate (full evaluation)
  • User runs /rubric-code-quality, /rubric-functionality, /rubric-testing, or /rubric-documentation (single category)
  • User runs /rubric-quick-score (automated-only quick score)
  • User asks to "evaluate", "score", or "check rubric"

Input Contract

  • Required: Access to the project root directory
  • Optional: Specific category to evaluate (defaults to all 4)
  • Optional: INSTRUCTIONS.md for requirements tracking
  • Optional: Previous scorecard for comparison

Output Contract

  • Structured scorecard with all 13 sub-criteria scored
  • Per-sub-criterion evidence (specific files, line numbers, examples)
  • Total score out of 400 with percentage
  • Prioritized improvement list ordered by (points gained / effort)
  • Comparison against previous score if available

Tool Permissions

  • Read: All source files, test files, README.md, INSTRUCTIONS.md, coverage reports, lint output
  • Execute: go test -coverprofile, golangci-lint run, gocyclo, go vet, wc, grep, find
  • Write: Scorecard output to stdout or .cursor/evaluations/
  • Search: File patterns, directory structures, code patterns

Execution Steps

Step 0: Setup

  1. Identify the project root (look for go.mod)
  2. Check for INSTRUCTIONS.md -- if present, parse requirements into a checklist
  3. Check for existing coverage reports (coverage.out)
  4. Check for existing lint configuration (.golangci.yml)

Step 1: Code Quality Assessment (100 points)

1a. Organization (25 points)

Run these checks:

bash
# Check for layered architecture
ls -d cmd/ internal/ 2>/dev/null
ls -d internal/handler/ internal/service/ internal/repository/ internal/models/ 2>/dev/null
# Count distinct packages
find . -name "*.go" -not -path "./vendor/*" | xargs grep -l "^package " | sort -u | wc -l

Scoring:

  • 25: cmd/, internal/ with handler, service, repository, models, middleware, config subdirectories
  • 20: Layered but missing 1-2 expected directories
  • 15: Some separation but mixed concerns (e.g., handlers and services in same package)
  • 10: Minimal organization, most code in 2-3 files
  • 5: Single file or flat structure

1b. Naming (25 points)

Run these checks:

bash
# Check for unexported names starting with uppercase (naming convention violation)
golangci-lint run --enable=revive --disable-all ./... 2>&1 | grep -c "naming"
# Check for abbreviations in exported names
grep -rn "func [A-Z]" --include="*.go" | grep -E "(Mgr|Svc|Repo|Cfg|Ctx|Req|Resp|Msg|Err|Val|Num|Str|Int|Buf|Arr|Lst)" | wc -l
# Check for descriptive test names
grep -rn "func Test" --include="*_test.go" | head -20

Scoring:

  • 25: All names descriptive, consistent Go conventions, no abbreviations, test names follow TestX_Y_Z pattern
  • 20: 1-2 unclear or abbreviated names
  • 15: Several naming inconsistencies
  • 10: Many abbreviations or unclear names
  • 5: Naming is poor throughout

1c. Readability (25 points)

Run these checks:

bash
# Cyclomatic complexity
gocyclo -over 10 . 2>/dev/null | wc -l
gocyclo -over 15 . 2>/dev/null | wc -l
# Function length (lines > 40)
# Count functions and their line lengths
grep -rn "^func " --include="*.go" | wc -l

Scoring:

  • 25: All functions < 40 lines, no cyclomatic complexity > 10, clear control flow
  • 20: 1-2 functions exceed limits, but well-commented
  • 15: Several long/complex functions, moderate comments
  • 10: Many long functions, insufficient comments
  • 5: Code is difficult to follow

1d. Best Practices (25 points)

Run these checks:

bash
# Full lint check
golangci-lint run ./... 2>&1 | tail -1
# Count issues
golangci-lint run ./... 2>&1 | grep -c "^.*\.go:"
# Check for float64 with money
grep -rn "float64" --include="*.go" | grep -i "amount\|balance\|price\|money\|total" | wc -l
# Check for proper error wrapping
grep -rn "fmt.Errorf" --include="*.go" | grep -c "%w"

Scoring:

  • 25: 0 lint issues, idiomatic Go, no anti-patterns, decimal for money, proper error wrapping
  • 20: 1-5 lint issues, minor style inconsistencies
  • 15: 6-15 lint issues, some anti-patterns
  • 10: 16-30 lint issues, multiple anti-patterns
  • 5: 30+ lint issues or critical anti-patterns (float64 for money)

Step 2: Functionality Assessment (100 points)

2a. Requirements Completion (40 points)

  1. Parse INSTRUCTIONS.md for explicit requirements
  2. For each requirement, search the codebase for implementation evidence
  3. Score = (implemented_count / total_requirements) * 40
bash
# Count API endpoints defined
grep -rn "router\.\(GET\|POST\|PUT\|DELETE\|PATCH\)" --include="*.go" | wc -l
# Or for Gin
grep -rn "\.GET\|\.POST\|\.PUT\|\.DELETE\|\.PATCH" --include="*.go" | grep -v "_test.go" | wc -l

2b. Edge Cases (30 points)

Check for these patterns:

bash
# Input validation patterns
grep -rn "validate\|Validate\|binding:" --include="*.go" | grep -v "_test.go" | wc -l
# Error return patterns
grep -rn "return.*err\|return.*Error\|return.*error" --include="*.go" | grep -v "_test.go" | wc -l
# Nil checks
grep -rn "== nil\|!= nil" --include="*.go" | grep -v "_test.go" | wc -l
# Idempotency key handling
grep -rn -i "idempoten" --include="*.go" | wc -l
# Negative amount checks
grep -rn "LessThan\|GreaterThan\|IsNegative\|IsZero\|Sign()" --include="*.go" | wc -l

Scoring:

  • 30: Validation on all endpoints, custom error types, boundary checks, idempotency, negative amount guards
  • 20: Most inputs validated, main error paths handled, some boundary checks
  • 10: Minimal validation, basic error handling
  • 0: No validation or error handling

2c. Performance (30 points)

Check for:

bash
# Database indexing
grep -rn "Index\|index\|INDEX\|uniqueIndex" --include="*.go" | grep -v "_test.go" | wc -l
# Pagination
grep -rn -i "limit\|offset\|page\|per_page\|cursor" --include="*.go" | grep -v "_test.go" | wc -l
# Connection pooling
grep -rn "SetMaxOpenConns\|SetMaxIdleConns\|pool" --include="*.go" | wc -l
# N+1 potential (loops with DB calls)
grep -rn "for.*range" --include="*.go" -A5 | grep -c "Find\|First\|Where\|Query"
# SELECT * usage (should be avoided)
grep -rn 'SELECT \*' --include="*.go" | wc -l

Scoring:

  • 30: Indexed columns, pagination, connection pooling, no N+1 patterns, no SELECT *
  • 20: Most performance concerns addressed, 1-2 minor issues
  • 10: Some attention to performance
  • 0: Obvious performance problems

Step 3: Testing Assessment (100 points)

3a. Coverage (40 points)

Run:

bash
# Generate coverage
go test -coverprofile=coverage.out ./... 2>&1
# Extract percentage
go tool cover -func=coverage.out | tail -1
# Per-package coverage
go tool cover -func=coverage.out | grep "total:"

Scoring:

  • 40: 80%+ total coverage
  • 30: 60-79% coverage
  • 20: 40-59% coverage
  • 10: 20-39% coverage
  • 5: < 20% coverage

3b. Quality (30 points)

Check:

bash
# Table-driven tests
grep -rn "tests := \[\]struct\|testCases := \[\]struct\|tt := \[\]struct" --include="*_test.go" | wc -l
# Subtests
grep -rn "t\.Run(" --include="*_test.go" | wc -l
# Negative test cases
grep -rn "Error\|Fail\|Invalid\|NotFound\|BadRequest\|Unauthorized" --include="*_test.go" | wc -l
# Assertion count
grep -rn "assert\.\|require\.\|if.*!=" --include="*_test.go" | wc -l
# Test helper functions
grep -rn "func.*testing\.T\|func.*testing\.B\|t\.Helper()" --include="*_test.go" | wc -l

Scoring:

  • 30: Table-driven tests throughout, meaningful assertions, negative + edge cases, subtests, helpers
  • 20: Some table-driven tests, good assertions, main paths tested
  • 10: Basic assertions, mostly happy path
  • 0: Trivial or no meaningful tests

3c. Organization (30 points)

Check:

bash
# Test file count
find . -name "*_test.go" -not -path "./vendor/*" | wc -l
# Test directories
find . -type d -name "tests" -o -name "testdata" -o -name "fixtures" | wc -l
# Test helper files
find . -name "*helper*" -o -name "*fixture*" -o -name "*factory*" | grep -v vendor | wc -l
# Integration vs unit test separation
find . -name "*_integration_test.go" -o -name "*_e2e_test.go" | wc -l

Scoring:

  • 30: Separate test dirs, helper/fixture packages, unit + integration + e2e, clear naming convention
  • 20: Tests alongside code, some helpers, good naming
  • 10: Tests exist but disorganized, no helpers
  • 0: No test organization at all

Step 4: Documentation Assessment (100 points)

4a. README Quality (40 points)

Check:

bash
# README exists
test -f README.md && echo "exists" || echo "missing"
# Required sections
for section in "Description" "Prerequisites" "Setup" "Install" "Build" "Run" "Test" "API" "Usage" "Example"; do
  grep -qi "$section" README.md 2>/dev/null && echo "FOUND: $section" || echo "MISSING: $section"
done
# Section count
grep -c "^##" README.md 2>/dev/null || echo 0
# Word count
wc -w README.md 2>/dev/null

Scoring:

  • 40: 8+ sections, clear setup/build/run/test instructions, API documentation, examples, 500+ words
  • 30: 5-7 sections, most instructions present, some examples
  • 20: 3-4 sections, basic instructions
  • 10: Minimal README (just title and description)
  • 0: No README

4b. Code Documentation (30 points)

Check:

bash
# Count exported functions
grep -rn "^func [A-Z]" --include="*.go" | grep -v "_test.go" | wc -l
# Count documented exported functions (comment on line before func)
grep -rn -B1 "^func [A-Z]" --include="*.go" | grep -v "_test.go" | grep -c "^.*\.go.*\/\/"
# Exported types
grep -rn "^type [A-Z]" --include="*.go" | grep -v "_test.go" | wc -l
# Documented exported types
grep -rn -B1 "^type [A-Z]" --include="*.go" | grep -v "_test.go" | grep -c "^.*\.go.*\/\/"

Scoring:

  • 30: 90%+ exported items documented with godoc, complex logic commented, API docs generated
  • 20: 60-89% documented
  • 10: 30-59% documented
  • 0: < 30% documented

4c. Design Decisions (30 points)

Check:

bash
# ADR files
find . -name "ADR*" -o -name "adr*" -o -name "DECISIONS*" | wc -l
# Design section in README
grep -qi "design\|architecture\|decisions\|trade.off\|limitations" README.md 2>/dev/null && echo "found" || echo "missing"
# Inline design comments
grep -rn "// Design:\|// Architecture:\|// Trade-off:\|// Why:" --include="*.go" | wc -l

Scoring:

  • 30: Dedicated design decisions section/doc, architecture explanation, tech choice rationale, trade-offs, limitations, "what I'd improve" section
  • 20: Some design explanation in README or comments
  • 10: Minimal design rationale
  • 0: No design documentation

Step 5: Produce Scorecard

Compile all scores into the output format:

markdown
## Rubric Scorecard -- [Date]

### Code Quality: XX/100
| Sub-Criteria | Score | Evidence |
|---|---|---|
| Organization | XX/25 | [directories found] |
| Naming | XX/25 | [naming issues found] |
| Readability | XX/25 | [complexity metrics] |
| Best Practices | XX/25 | [lint issue count] |

### Functionality: XX/100
| Sub-Criteria | Score | Evidence |
|---|---|---|
| Requirements | XX/40 | [X/Y implemented] |
| Edge Cases | XX/30 | [validation patterns found] |
| Performance | XX/30 | [performance patterns found] |

### Testing: XX/100
| Sub-Criteria | Score | Evidence |
|---|---|---|
| Coverage | XX/40 | [XX% coverage] |
| Quality | XX/30 | [table-driven count, assertion count] |
| Organization | XX/30 | [test file count, helper count] |

### Documentation: XX/100
| Sub-Criteria | Score | Evidence |
|---|---|---|
| README Quality | XX/40 | [section count, word count] |
| Code Docs | XX/30 | [XX% documented] |
| Design Decisions | XX/30 | [design docs found] |

---

### TOTAL: XXX/400 (XX%)

### Grade
- 360-400: Exceptional (90-100%)
- 320-359: Strong (80-89%)
- 280-319: Good (70-79%)
- 240-279: Adequate (60-69%)
- 200-239: Needs Improvement (50-59%)
- < 200: Below Expectations (< 50%)

### Top Priority Improvements
| # | Action | Estimated Points | Effort | Category |
|---|---|---|---|---|
| 1 | [action] | +XX | Low/Med/High | [category] |
| 2 | [action] | +XX | Low/Med/High | [category] |
| 3 | [action] | +XX | Low/Med/High | [category] |

Success Criteria

  • All 13 sub-criteria scored with numeric value and evidence
  • Total score calculated correctly
  • At least 3 prioritized improvements identified
  • Improvements ordered by points-per-effort ratio
  • If INSTRUCTIONS.md exists, requirements tracked individually

Escalation Rules

  • Score < 200/400: Escalate urgently -- recommend focusing on highest-value items
  • Any 40-point sub-criterion scoring 0: Critical flag
  • Test suite fails to compile: Block further evaluation, fix tests first
  • No README: Immediate action item (40 easy points at stake)