AgentSkillsCN

create-semgrep-rule

创建用于漏洞检测的自定义 Semgrep 规则。在编写针对特定漏洞模式的新规则、创建组织专属检测,或为漏洞赏金猎人发现的新型攻击向量构建规则时使用。

SKILL.md
--- frontmatter
name: create-semgrep-rule
description: Create custom Semgrep rules for vulnerability detection. Use when writing new rules for specific vulnerability patterns, creating org-specific detections, or building rules for novel attack vectors discovered during bug bounty hunting.

Create Custom Semgrep Rules

Expert workflow for creating high-quality, low-false-positive Semgrep rules for security vulnerability detection.

When to Create Custom Rules

Create custom rules when:

  • Novel vulnerability patterns not covered by p/default or existing custom rules
  • Org-specific code patterns (custom frameworks, internal APIs, coding conventions)
  • Chained vulnerabilities requiring multi-step detection
  • Language/framework-specific bugs (e.g., PHP parse_url bypass, Go unsafe patterns)
  • High-value targets warranting deeper, targeted analysis
  • CVE variant hunting - Finding the same vulnerable pattern in other codebases

CVE-to-Rule Workflow

When creating rules from CVEs, the goal is to find the underlying vulnerable code pattern in OTHER codebases - NOT to detect the vulnerable library (SCA tools like Dependabot/Snyk do that better).

Anti-Pattern: SCA-Style Detection (DON'T DO THIS)

yaml
# WRONG - This is SCA work, not pattern detection
# Dependabot/Snyk already do this, and do it better
patterns:
  - pattern: require("loader-utils").parseQuery(...)
  - pattern: import { parseQuery } from "loader-utils"
  - pattern: require("vulnerable-package")

This approach:

  • Duplicates what SCA tools already do
  • Only finds the specific library, not the pattern
  • Misses the same vulnerability in custom code
  • Provides no value for bug bounty hunting

Correct Approach: Pattern Detection

Step 1: Fetch and analyze the fix commit

bash
# Get the patch diff
curl -s https://github.com/org/repo/commit/abc123.patch

Ask yourself:

  • What was the root cause of the vulnerability?
  • What code pattern made it exploitable?
  • How did the fix address the root cause?
  • What would this pattern look like in custom code?

Step 2: Abstract the pattern

The key question: "If a developer wrote similar functionality from scratch, what would the vulnerable version look like?"

Don't think about the library. Think about the category of code that has this problem.

Step 3: Create a library-agnostic rule

The rule should find the SAME MISTAKE anywhere, not just in the specific library.

Example: CVE-2022-37601 (loader-utils Prototype Pollution)

Fix commit analysis:

javascript
// BEFORE (vulnerable)
const result = {};           // Has prototype chain
result[key] = value;         // key could be "__proto__"

// AFTER (fixed)
const result = Object.create(null);  // No prototype chain
result[key] = value;                 // "__proto__" is just a regular key

Root cause: Query string parsing into {} with unsanitized dynamic keys.

Abstracted pattern: Any code that:

  1. Creates an object with {} (not Object.create(null))
  2. Assigns properties using dynamic/user-controlled keys
  3. Doesn't validate against __proto__, constructor, prototype

Rule focus: Find custom query parsers, config loaders, merge utilities, or any key-value processing with this antipattern.

What to detect:

javascript
// DETECT: Custom query parser with same vulnerability
function parseConfig(input) {
  const config = {};                    // Vulnerable: has prototype
  for (const [key, val] of entries) {
    config[key] = val;                  // Unsanitized key assignment
  }
  return config;
}

// DETECT: Custom merge/extend function
function merge(target, source) {
  for (const key in source) {
    target[key] = source[key];          // Prototype pollution sink
  }
}

What NOT to detect:

javascript
// SKIP: Using the library (SCA handles this)
const { parseQuery } = require("loader-utils");

// SKIP: Already using safe pattern
const result = Object.create(null);
result[key] = value;

// SKIP: Has prototype pollution guard
if (key === "__proto__" || key === "constructor") continue;

CVE-to-Rule Checklist

Before writing the rule, verify:

CheckQuestion
Root cause identifiedWhat code pattern caused the vulnerability?
Pattern abstractedWould I find this in custom code, not just the library?
Not SCAAm I detecting a pattern, not a library import?
Realistic matchesWill this find bugs in real-world code?
Low FP rateAre there clear safe patterns to exclude?

Common CVE Pattern Categories

CVE TypeRoot Cause PatternRule Focus
Prototype Pollutionobj[userKey] = val on {}Custom parsers, merge functions
Template InjectionUser input in template optionsCustom template rendering
Command InjectionString concat to shell execCustom exec wrappers
Path TraversalUser input in file pathsCustom file handlers
SSRFUser input in URL constructionCustom HTTP clients
DeserializationUntrusted data to deserializerCustom data loaders

Rule Broadness: When Patterns Are Too Generic

Some vulnerability patterns are too common to detect without drowning in false positives. Before writing a rule, assess whether it will produce signal or noise.

Pattern Frequency Spectrum

Signal LevelPattern TypeExampleApproach
HIGHRare sink + user inputres.render(tpl, req.query)Direct detection, HIGH confidence
MEDIUMCommon pattern + specific contextobj[key] = val in loopsAudit rule, MEDIUM confidence
LOWUbiquitous patternobj[key] = val anywhereSkip or sink-focused only

Example: Prototype Pollution

Too broad (produces noise):

yaml
# This matches almost every JS file
pattern: $OBJ[$KEY] = $VALUE

Specific enough (produces signal):

yaml
# Recursive descent pattern - characteristic of vulnerable merge functions
patterns:
  - pattern: $SMTH = $SMTH[$A]
  - pattern-inside: |
      for (...) { ... }

Sink-focused (best signal):

yaml
# Detect where pollution becomes exploitable
pattern-sinks:
  - pattern: res.render($T, $OPTS)  # Template options = RCE
  - pattern: spawn($CMD, $ARGS, $OPTS)  # child_process options

When to Use Audit vs Vuln Rules

Rule TypeConfidenceUse Case
subcategory: vulnHIGHRare pattern, clear exploit, few FPs
subcategory: auditLOW-MEDIUMCommon pattern, needs manual review

If you can't achieve HIGH confidence, mark the rule as audit with LOW confidence. The official Semgrep registry does this for prototype pollution:

yaml
metadata:
  subcategory: audit
  confidence: LOW
  likelihood: LOW

Sink-Focused vs Pattern-Focused Rules

When a vulnerability pattern is too common to detect directly, focus on the sinks where it becomes exploitable:

VulnerabilityPattern-Focused (noisy)Sink-Focused (high signal)
Prototype Pollutionobj[key] = valTemplate options, child_process options
XSSString concatenationinnerHTML, document.write
SQLiString + variablecursor.execute, ORM raw queries

Rule of thumb: If the source pattern is ubiquitous, detect at the sink instead.

Project Structure

code
custom-rules/
├── 0xdea-semgrep-rules/     # Third-party: Memory safety, C/C++ vulns
├── open-semgrep-rules/      # Third-party: Multi-language security rules
├── web-vulns/               # Web-specific injection rules
└── custom/                  # YOUR custom rules
    ├── org-specific/        # Rules targeting specific organizations
    │   └── <org-name>/      # Per-org rule directories
    └── novel-vulns/         # Novel vulnerability patterns

CRITICAL: Rule Quality Standards

Custom rules must meet these standards before use:

  • LOW false positive rate - Every FP wastes time; add exclusions aggressively
  • Clear security impact - Rule must detect exploitable vulnerabilities, not code smells
  • Tested against real code - Validate on target repos before adding to pipeline
  • Complete metadata - CWE, severity, confidence, references
  • Path exclusions for performance - Exclude bundled/minified files to prevent timeouts

CRITICAL: Path Exclusions for Performance

Taint mode rules are computationally expensive and will timeout on large bundled/minified files. Always add path exclusions to your rules.

Required Path Exclusions

Add this paths block to EVERY rule (especially taint mode):

yaml
rules:
  - id: my-taint-rule
    mode: taint
    paths:
      exclude:
        # Package managers
        - "**/node_modules/**"
        - "**/vendor/**"
        # Build output
        - "**/dist/**"
        - "**/build/**"
        # Minified/bundled files (specific patterns only)
        - "**/*.min.js"
        - "**/*.min.mjs"
        - "**/*.bundle.js"
        - "**/*.chunk.js"
        - "**/*.chunk.mjs"
        - "**/*-init.mjs"
        # NOTE: Do NOT use broad patterns like "**/js/*.js" or "**/assets/**"
        # as they exclude legitimate source files in some repos
    # ... rest of rule

Why This Matters

File TypeTypical SizeTaint Mode Behavior
Source file1-50 KBFast analysis
Bundled JS100KB-2MBTIMEOUT (30s default)
Minified JS50KB-500KBTIMEOUT or very slow

Real example: A 588KB Vite bundle (viewer-init.mjs) caused 3 timeout errors and blocked rule execution until path exclusions were added.

Signs You Need More Exclusions

When running your rule, watch for:

code
Warning: 3 timeout error(s) in path/to/file.mjs when running rules...
Semgrep stopped running rules on path/to/file.mjs after 3 timeout error(s).

Add the problematic file pattern to your paths.exclude list.

Workflow

Step 1: Define the Vulnerability

Before writing any YAML, answer these questions:

code
Vulnerability Type: [e.g., Command Injection, SSRF, SQLi]
CWE ID: [e.g., CWE-78]
Security Impact: [e.g., Remote code execution as web server user]
Vulnerable Pattern: [e.g., os.system() with user-controlled input]
Exploit Scenario: [e.g., Attacker controls filename parameter, injects shell commands]

Find 2-3 real examples from target codebase to guide pattern creation.

Step 2: Choose Rule Mode

ModeUse WhenExample
Pattern-basedSingle function calls, hardcoded values, dangerous API usageeval(), hardcoded secrets, weak crypto
Taint modeData flows from user input to dangerous sinkSQLi, XSS, command injection, SSRF

Decision guide:

  • "Is user input involved?" → Taint mode
  • "Is it a dangerous function regardless of input?" → Pattern mode
  • "Do I need to track data across variables/functions?" → Taint mode

Step 3: Write the Rule

Pattern-Based Rule Template

yaml
rules:
  - id: <org>-<vuln-type>-<specific-pattern>
    languages:
      - python
    message: |
      <Clear description of what was detected and why it's dangerous>

      Remediation: <Specific fix recommendation>
    severity: ERROR  # ERROR, WARNING, or INFO
    metadata:
      cwe: "CWE-XX"
      owasp:
        - "A03:2021-Injection"
      category: security
      confidence: HIGH  # HIGH, MEDIUM, LOW
      author: "Your Name"
      references:
        - https://cwe.mitre.org/data/definitions/XX.html
    patterns:
      - pattern-either:
          - pattern: dangerous_function($ARG)
          - pattern: other_dangerous_function($ARG)
      - pattern-not: safe_wrapper(...)
      - pattern-not-inside: |
          if $X is None:
              ...

Taint Mode Rule Template

yaml
rules:
  - id: <org>-<vuln-type>-taint
    mode: taint
    languages:
      - python  # or javascript, typescript, etc.
    # CRITICAL: Always include path exclusions for taint mode
    paths:
      exclude:
        - "**/node_modules/**"
        - "**/vendor/**"
        - "**/dist/**"
        - "**/build/**"
        - "**/*.min.js"
        - "**/*.min.mjs"
        - "**/*.bundle.js"
        - "**/*.chunk.js"
        - "**/*.chunk.mjs"
        - "**/*-init.mjs"
    message: |
      User input flows to <dangerous sink> without proper sanitization.
      This could allow <attack type>.

      Remediation: <Specific fix>
    severity: ERROR
    metadata:
      cwe: "CWE-XX"
      owasp:
        - "A03:2021-Injection"
      category: security
      confidence: HIGH
      author: "Your Name"
    pattern-sources:
      - pattern: request.args.get(...)
      - pattern: request.form[...]
      - pattern: request.json[...]
    pattern-sinks:
      - pattern: cursor.execute($QUERY, ...)
        focus-metavariable: $QUERY
    pattern-sanitizers:
      - pattern: escape(...)
      - pattern: int(...)
      - pattern: parameterized_query(...)

Step 4: Reduce False Positives

This is the most critical step. For every rule, consider:

Exclusion patterns to add:

yaml
# Exclude hardcoded/literal strings (not user input)
- pattern-not: $FUNC("...", ...)

# Exclude safe wrappers
- pattern-not: safe_execute(...)

# Exclude already-validated contexts
- pattern-not-inside: |
    if validate($INPUT):
        ...

# Exclude test files (if not already in .semgrepignore)
- pattern-not-inside: |
    def test_...:
        ...

Common FP sources:

  • Hardcoded strings (not user-controlled)
  • Test/example code
  • Already-sanitized inputs
  • Framework auto-escaping
  • Admin-only code paths

Step 5: Test the Rule

Create test file alongside rule:

code
custom-rules/custom/novel-vulns/
├── command-injection-eval.yml
└── command-injection-eval.py    # Test cases

Test file format:

python
# ruleid: command-injection-eval
eval(user_input)

# ruleid: command-injection-eval
exec(request.args.get('code'))

# ok: command-injection-eval
eval("2 + 2")  # Hardcoded, safe

# ok: command-injection-eval
safe_eval(user_input)  # Uses sanitizer

Run validation:

bash
# Test rule syntax and test cases
semgrep --config custom-rules/custom/novel-vulns/command-injection-eval.yml \
        --test custom-rules/custom/novel-vulns/

# Test against real target repo
semgrep --config custom-rules/custom/novel-vulns/command-injection-eval.yml \
        repos/<org>/<repo>/

# Count findings
semgrep --config custom-rules/custom/novel-vulns/command-injection-eval.yml \
        repos/<org>/ --json | jq '.results | length'

Step 5b: Test Performance (CRITICAL for Taint Mode)

Taint mode rules can timeout on large files. Always test on repos with bundled JS:

bash
# Test against a repo known to have bundled files
time semgrep --config my-rule.yaml repos/<org>/<repo-with-bundles>/ 2>&1 | grep -E "(timeout|Error|Ran)"

Watch for these warning signs:

code
Warning: 3 timeout error(s) in path/to/file.mjs when running rules...

If you see timeouts:

  1. Check which files are causing issues:

    bash
    ls -la path/to/problematic/file.mjs  # Check file size
    head -c 200 path/to/problematic/file.mjs  # Check if minified
    
  2. Add path exclusions to your rule:

    yaml
    paths:
      exclude:
        - "**/path/pattern/*.mjs"
    
  3. Re-test until no timeouts:

    bash
    # Should complete in seconds, not timeout
    time semgrep --config my-rule.yaml repos/<org>/<repo>/
    

Performance targets:

Repo SizeExpected TimeAction if Slower
Small (<100 files)< 5 secondsCheck for bundled files
Medium (100-1000 files)< 30 secondsAdd path exclusions
Large (1000+ files)< 2 minutesVerify exclusions working

Verify findings still work after exclusions:

bash
# Run on source directory only (where real vulns are)
semgrep --config my-rule.yaml repos/<org>/<repo>/src/

Step 6: Integrate with Pipeline

Rules in custom-rules/ are automatically included when running:

bash
./scripts/scan-semgrep.sh <org-name>

To use only your custom rule:

bash
semgrep --config custom-rules/custom/novel-vulns/my-rule.yml repos/<org>/

Pattern Operators Reference

Basic Matching

OperatorPurposeExample
patternMatch exact codeos.system($CMD)
pattern-eitherMatch any (OR)Multiple dangerous functions
patternsMatch all (AND)Function + constraint

Metavariables

SyntaxMeaning
$VARCapture any expression
$_Match anything (no capture)
$...ARGSMatch multiple arguments
<... $X ...>Match $X nested at any depth
...Match any statements between

Exclusions (Critical for FP reduction)

yaml
pattern-not: safe_function(...)           # Exclude specific pattern
pattern-not-inside: |                     # Exclude if inside context
  if validated($X):
      ...

Metavariable Constraints

yaml
# Regex match on captured variable
metavariable-regex:
  metavariable: $FUNC
  regex: "(system|exec|popen)"

# Pattern match on captured variable
metavariable-pattern:
  metavariable: $ARG
  pattern-either:
    - pattern: request.args[...]
    - pattern: request.form[...]

# Entropy analysis (detect secrets)
metavariable-analysis:
  analyzer: entropy
  metavariable: $VALUE

# Highlight specific variable in output
focus-metavariable: $DANGEROUS_ARG

Taint Mode Operators

yaml
mode: taint                    # Enable taint tracking

pattern-sources:               # Where tainted data enters
  - pattern: request.args[...]

pattern-sinks:                 # Where tainted data causes harm
  - pattern: cursor.execute($Q)
    focus-metavariable: $Q

pattern-sanitizers:            # Functions that clean data
  - pattern: escape(...)
  - pattern: int(...)

pattern-propagators:           # Custom taint spread (Pro only)
  - pattern: $TO = transform($FROM)
    from: $FROM
    to: $TO

Common Rule Patterns

Command Injection

yaml
patterns:
  - pattern-either:
      - pattern: os.system($CMD)
      - pattern: os.popen($CMD)
      - pattern: subprocess.call($CMD, shell=True, ...)
      - pattern: subprocess.Popen($CMD, shell=True, ...)
  - pattern-not: $FUNC("...", ...)  # Exclude hardcoded strings

SQL Injection (Taint)

yaml
mode: taint
pattern-sources:
  - pattern: request.$METHOD[...]
  - pattern: request.$METHOD.get(...)
pattern-sinks:
  - pattern: $CURSOR.execute($QUERY, ...)
  - pattern: $CURSOR.executemany($QUERY, ...)
pattern-sanitizers:
  - pattern: $CURSOR.execute("...", ($PARAM,))  # Parameterized

Hardcoded Secrets

yaml
patterns:
  - pattern: $VAR = "..."
  - metavariable-regex:
      metavariable: $VAR
      regex: "(?i)(password|secret|api_key|token|private_key)"
  - metavariable-analysis:
      analyzer: entropy
      metavariable: $VAR
  - pattern-not-inside: |
      # Example: ...

Insecure Cryptography

yaml
pattern-either:
  - pattern: hashlib.md5(...)
  - pattern: hashlib.sha1(...)
  - pattern: DES.new(...)
  - pattern: Blowfish.new(...)
  - pattern: ARC4.new(...)

Path Traversal

yaml
mode: taint
pattern-sources:
  - pattern: request.args.get("...")
  - pattern: request.form["..."]
pattern-sinks:
  - pattern: open($PATH, ...)
  - pattern: os.path.join(..., $PATH, ...)
pattern-sanitizers:
  - pattern: os.path.basename(...)
  - pattern: secure_filename(...)

Metadata Standards

Every rule MUST include:

yaml
metadata:
  # Required
  cwe: "CWE-78"                      # Primary CWE ID
  category: security                  # Always "security" for vulns
  confidence: HIGH                    # HIGH, MEDIUM, LOW

  # Recommended
  owasp:
    - "A03:2021-Injection"           # OWASP Top 10 2021
  likelihood: HIGH                    # Exploitation probability
  impact: HIGH                        # Damage if exploited
  subcategory:
    - vuln                           # vuln, audit, guardrail

  # For custom rules
  author: "Your Name"
  created: "2025-01-15"
  tested_against: "org-name"         # Where you validated it
  references:
    - https://cwe.mitre.org/...
    - https://blog.example.com/...   # Writeups explaining the vuln

Severity Guidelines

SeverityUse ForExamples
ERRORExploitable vulns with high impactRCE, SQLi, auth bypass
WARNINGLikely vulns needing verificationPotential XSS, weak crypto
INFOCode smells, audit pointsMissing headers, debug code

Pro Engine Features

When running with --pro (our default), you get:

  • Cross-file taint tracking - Follow data across imports
  • Interprocedural analysis - Track through function calls
  • Field sensitivity - Track object properties

These are automatic; no rule changes needed.

Debugging Rules

Rule not matching expected code?

bash
# Verbose output shows matching attempts
semgrep --config rule.yml target/ --debug

# Test specific pattern interactively
semgrep --pattern 'os.system($X)' target/

Too many false positives?

  • Add pattern-not for safe patterns
  • Add pattern-not-inside for safe contexts
  • Use metavariable-regex to constrain variable names
  • Lower confidence in metadata if FPs are expected

Output

Save completed rules to:

code
custom-rules/custom/
├── org-specific/<org-name>/    # Org-targeted rules
└── novel-vulns/                # General novel patterns

Rules are automatically picked up by ./scripts/scan-semgrep.sh.

References