Semgrep Static Analysis
When to Use Semgrep
Ideal scenarios:
- •Quick security scans (minutes, not hours)
- •Pattern-based vulnerability detection
- •Enforcing coding standards and best practices
- •Finding known vulnerability patterns (OWASP Top 10, CWE Top 25)
- •Intra-file taint analysis and data flow tracking
- •Custom rule development for specific code patterns
- •First-pass security analysis before deeper tools
- •CI/CD security gates for fast feedback
- •Multi-language security scanning
Complements other tools:
- •Use before manual code review to catch common patterns
- •Combine with SARIF Issue Reporter for detailed findings
- •Use alongside CodeQL for comprehensive coverage
- •Pair with dependency scanners (OSV-Scanner, Depscan)
Consider CodeQL instead when:
- •Need interprocedural taint tracking across files
- •Complex data flow analysis across modules required
- •Analyzing custom proprietary frameworks with deep integration
When NOT to Use
Do NOT use this skill for:
- •Complex interprocedural data flow analysis (use CodeQL instead)
- •Binary analysis or compiled code without source
- •Custom deep semantic analysis requiring AST/CFG traversal
- •Tracking taint across many function boundaries and files
- •Secrets detection (use Gitleaks)
- •Dependency vulnerability scanning (use OSV-Scanner or Depscan)
- •IaC security analysis (use KICS)
- •API endpoint discovery (use Noir)
Installation
bash
# pip
python3 -m pip install semgrep
# pipx (recommended)
pipx install semgrep
# Homebrew
brew install semgrep
# Docker
docker pull returntocorp/semgrep:latest
docker run --rm -v "${PWD}:/src" returntocorp/semgrep semgrep --config auto /src
# Update
pip install --upgrade semgrep
# Verify
semgrep --version
Core Workflow
1. Quick Scan
bash
semgrep --config auto . # Auto-detect rules semgrep --config auto --metrics=off . # Disable telemetry for proprietary code
2. Use Rulesets
bash
semgrep --config p/<RULESET> . # Single ruleset semgrep --config p/security-audit --config p/trailofbits . # Multiple
| Ruleset | Description |
|---|---|
p/default | General security and code quality |
p/security-audit | Comprehensive security rules |
p/owasp-top-ten | OWASP Top 10 vulnerabilities |
p/cwe-top-25 | CWE Top 25 vulnerabilities |
p/r2c-security-audit | r2c security audit rules |
p/trailofbits | Trail of Bits security rules |
p/python | Python-specific |
p/javascript | JavaScript-specific |
p/golang | Go-specific |
3. Output Formats
bash
# SARIF output (for CI/CD) semgrep --config p/security-audit --sarif -o results.sarif . # JSON output semgrep --config p/security-audit --json -o results.json . # Text output with dataflow traces semgrep --config p/security-audit --dataflow-traces . # JUnit XML semgrep --config p/security-audit --junit-xml -o results.xml . # GitLab SAST format semgrep --config p/security-audit --gitlab-sast -o gl-sast-report.json . # Vim quickfix semgrep --config p/security-audit --vim .
4. Scan Specific Paths
bash
# Single file semgrep --config p/python app.py # Specific directory semgrep --config p/javascript src/ # Include tests (excluded by default) semgrep --config auto --include='**/test/**' . # Exclude paths semgrep --config auto --exclude='vendor' --exclude='node_modules' . # Multiple languages semgrep --config p/python --config p/javascript .
5. Advanced Features
bash
# Enable Pro Engine features (requires license) semgrep --config p/security-audit --pro . # Pro Engine interfile analysis semgrep --config p/security-audit --pro --pro-intrafile . # Disable telemetry semgrep --config auto --metrics=off . # Verbose output semgrep --config p/security-audit --verbose . # Quiet mode (only show findings) semgrep --config p/security-audit --quiet .
Writing Custom Rules
Basic Structure
yaml
rules:
- id: hardcoded-password
languages: [python]
message: "Hardcoded password detected: $PASSWORD"
severity: ERROR
pattern: password = "$PASSWORD"
Pattern Syntax
| Syntax | Description | Example |
|---|---|---|
... | Match anything | func(...) |
$VAR | Capture metavariable | $FUNC($INPUT) |
<... ...> | Deep expression match | <... user_input ...> |
Pattern Operators
| Operator | Description |
|---|---|
pattern | Match exact pattern |
patterns | All must match (AND) |
pattern-either | Any matches (OR) |
pattern-not | Exclude matches |
pattern-inside | Match only inside context |
pattern-not-inside | Match only outside context |
pattern-regex | Regex matching |
metavariable-regex | Regex on captured value |
metavariable-comparison | Compare values |
Combining Patterns
yaml
rules:
- id: sql-injection
languages: [python]
message: "Potential SQL injection"
severity: ERROR
patterns:
- pattern-either:
- pattern: cursor.execute($QUERY)
- pattern: db.execute($QUERY)
- pattern-not:
- pattern: cursor.execute("...", (...))
- metavariable-regex:
metavariable: $QUERY
regex: .*\+.*|.*\.format\(.*|.*%.*
Taint Mode (Data Flow)
Simple pattern matching finds obvious cases:
python
# Pattern `os.system($CMD)` catches this: os.system(user_input) # Found
But misses indirect flows:
python
# Same pattern misses this: cmd = user_input processed = cmd.strip() os.system(processed) # Missed - no direct match
Taint mode tracks data through assignments and transformations:
- •Source: Where untrusted data enters (
user_input) - •Propagators: How it flows (
cmd = ...,processed = ...) - •Sanitizers: What makes it safe (
shlex.quote()) - •Sink: Where it becomes dangerous (
os.system())
yaml
rules:
- id: command-injection
languages: [python]
message: "User input flows to command execution"
severity: ERROR
mode: taint
pattern-sources:
- pattern: request.args.get(...)
- pattern: request.form[...]
- pattern: request.json
pattern-sinks:
- pattern: os.system($SINK)
- pattern: subprocess.call($SINK, shell=True)
- pattern: subprocess.run($SINK, shell=True, ...)
pattern-sanitizers:
- pattern: shlex.quote(...)
- pattern: int(...)
Full Rule with Metadata
yaml
rules:
- id: flask-sql-injection
languages: [python]
message: "SQL injection: user input flows to query without parameterization"
severity: ERROR
metadata:
cwe: "CWE-89: SQL Injection"
owasp: "A03:2021 - Injection"
confidence: HIGH
mode: taint
pattern-sources:
- pattern: request.args.get(...)
- pattern: request.form[...]
- pattern: request.json
pattern-sinks:
- pattern: cursor.execute($QUERY)
- pattern: db.execute($QUERY)
pattern-sanitizers:
- pattern: int(...)
fix: cursor.execute($QUERY, (params,))
Testing Rules
Test File Format
python
# test_rule.py
def test_vulnerable():
user_input = request.args.get("id")
# ruleid: flask-sql-injection
cursor.execute("SELECT * FROM users WHERE id = " + user_input)
def test_safe():
user_input = request.args.get("id")
# ok: flask-sql-injection
cursor.execute("SELECT * FROM users WHERE id = ?", (user_input,))
bash
semgrep --test rules/
CI/CD Integration (GitHub Actions)
yaml
name: Semgrep
on:
push:
branches: [main]
pull_request:
schedule:
- cron: '0 0 1 * *' # Monthly
jobs:
semgrep:
runs-on: ubuntu-latest
container:
image: returntocorp/semgrep
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0 # Required for diff-aware scanning
- name: Run Semgrep
run: |
if [ "${{ github.event_name }}" = "pull_request" ]; then
semgrep ci --baseline-commit ${{ github.event.pull_request.base.sha }}
else
semgrep ci
fi
env:
SEMGREP_RULES: >-
p/security-audit
p/owasp-top-ten
p/trailofbits
Configuration
.semgrepignore
code
tests/fixtures/ **/testdata/ generated/ vendor/ node_modules/
Suppress False Positives
python
password = get_from_vault() # nosemgrep: hardcoded-password dangerous_but_safe() # nosemgrep
Performance
bash
semgrep --config rules/ --time . # Check rule performance ulimit -n 4096 # Increase file descriptors for large codebases
Path Filtering in Rules
yaml
rules:
- id: my-rule
paths:
include: [src/]
exclude: [src/generated/]
Common Use Cases
1. Comprehensive Security Audit
bash
# Multi-ruleset scan with SARIF output semgrep scan \ --config p/security-audit \ --config p/owasp-top-ten \ --config p/cwe-top-25 \ --sarif -o security-audit.sarif \ .
2. Language-Specific Deep Scan
bash
# Python with taint mode semgrep scan \ --config p/python \ --config p/flask \ --config p/django \ --dataflow-traces \ --sarif -o python-security.sarif \ ./backend # JavaScript/TypeScript semgrep scan \ --config p/javascript \ --config p/typescript \ --config p/react \ --sarif -o js-security.sarif \ ./frontend
3. Custom Rules with Existing Rulesets
bash
# Combine custom and community rules semgrep scan \ --config ./custom-rules \ --config p/security-audit \ --sarif -o combined-scan.sarif \ .
4. CI/CD Diff Scanning
bash
# Scan only changed files (PR context) git diff --name-only origin/main...HEAD | \ xargs semgrep scan --config p/security-audit --sarif -o diff-scan.sarif
Understanding Output
SARIF Structure
Semgrep SARIF v2.1.0 includes:
- •Rules: Each Semgrep rule with metadata
- •Results: Specific code locations matching patterns
- •Properties:
- •Severity: ERROR, WARNING, INFO
- •CWE and OWASP mappings
- •Confidence levels
- •Fix suggestions (if available)
- •Dataflow traces (if enabled)
Result Categories
| Severity | Meaning |
|---|---|
| ERROR | High-confidence security vulnerability |
| WARNING | Potential security issue requiring review |
| INFO | Code smell or best practice violation |
Autofix
bash
# Show available fixes semgrep scan --config p/security-audit --autofix --dryrun . # Apply fixes automatically semgrep scan --config p/security-audit --autofix . # Review fixes before applying semgrep scan --config p/security-audit --autofix --dryrun . | less
Third-Party Rules
bash
# Trail of Bits rules git clone https://github.com/trailofbits/semgrep-rules.git semgrep scan -f semgrep-rules/rules --sarif -o results.sarif . # Semgrep Registry semgrep scan --config "r/trailofbits" . # Custom remote rules semgrep scan --config https://example.com/custom-rules.yaml .
Advanced Rule Development
Using Metavariable Propagation
yaml
rules:
- id: context-aware-xss
languages: [javascript]
message: "XSS: User input flows to innerHTML"
severity: ERROR
mode: taint
pattern-sources:
- pattern: req.query.$PARAM
pattern-propagators:
- pattern: $X.toString()
from: $X
to: $X.toString()
- pattern: `${$X}`
from: $X
to: `${$X}`
pattern-sinks:
- pattern: $ELEMENT.innerHTML = $DATA
pattern-sanitizers:
- pattern: DOMPurify.sanitize($X)
Focus Metavariables
yaml
rules:
- id: sql-injection-advanced
languages: [python]
message: "SQL injection via string formatting"
severity: ERROR
pattern: |
$CURSOR.execute($QUERY)
focus-metavariable: $QUERY
metavariable-regex:
metavariable: $QUERY
regex: .*(\+|format|%).*
Performance Optimization
bash
# Limit to specific file types semgrep scan --include='*.py' --include='*.js' . # Increase timeout for large files semgrep scan --timeout 60 . # Use baseline for faster incremental scans semgrep scan --baseline-commit HEAD~1 . # Parallel processing (default uses all CPUs) semgrep scan --jobs 4 . # Disable expensive rules semgrep scan --config p/security-audit --exclude-rule 'expensive-rule-id' .
Supported Languages
Semgrep supports 30+ languages:
- •Web: JavaScript, TypeScript, JSX, TSX, HTML
- •Backend: Python, Go, Java, Kotlin, Scala, C#
- •Systems: C, C++, Rust
- •Mobile: Swift, Kotlin, Java, Objective-C
- •Scripting: Ruby, PHP, Bash, Lua, Perl
- •Infrastructure: Terraform, Dockerfile, YAML, JSON
- •Data: SQL (generic)
- •Other: Elixir, Clojure, Solidity, Apex, R
Semgrep Pro vs Community Edition
| Feature | Community | Pro |
|---|---|---|
| Pattern matching | ✓ | ✓ |
| Intra-file taint | ✓ | ✓ |
| Custom rules | ✓ | ✓ |
| SARIF output | ✓ | ✓ |
| Cross-file analysis | ✗ | ✓ |
| Interfile taint | ✗ | ✓ |
| Supply chain | ✗ | ✓ |
| Secrets detection | ✗ | ✓ |
| Assistant (AI) | ✗ | ✓ |
Troubleshooting
Common Issues
bash
# Rule parsing errors semgrep scan --validate --config custom-rules.yaml # Timeout on large files semgrep scan --timeout 120 . # Memory issues semgrep scan --max-memory 4000 . # MB # Debug mode semgrep scan --debug --config p/security-audit .
Rule Testing
bash
# Test rules against test files semgrep scan --test rules/ # Validate rule syntax semgrep scan --validate --config rules/my-rule.yaml # Benchmark rules semgrep scan --time --config rules/ test-codebase/
Limitations
- •Cross-file limited: Intra-file taint only in Community Edition
- •Pattern-based: Can't understand complex business logic
- •Performance: Large codebases with many rules can be slow
- •False positives: Regex patterns may over-match
- •Language gaps: Some languages have limited rule coverage
Rationalizations to Reject
| Shortcut | Why It's Wrong |
|---|---|
| "Semgrep found nothing, code is clean" | Semgrep is pattern-based; it can't track complex data flow across functions |
| "I wrote a rule, so we're covered" | Rules need testing with semgrep --test; false negatives are silent |
| "Taint mode catches injection" | Only if you defined all sources, sinks, AND sanitizers correctly |
| "Pro rules are comprehensive" | Pro rules are good but not exhaustive; supplement with custom rules for your codebase |
| "Too many findings = noisy tool" | High finding count often means real problems; tune rules, don't disable them |
References
- •Registry: https://semgrep.dev/explore
- •Playground: https://semgrep.dev/playground
- •Documentation: https://semgrep.dev/docs/
- •Rule Examples: https://semgrep.dev/docs/writing-rules/rule-ideas
- •Pattern Syntax: https://semgrep.dev/docs/writing-rules/pattern-syntax
- •Trail of Bits Rules: https://github.com/trailofbits/semgrep-rules
- •OWASP Rules: https://semgrep.dev/p/owasp-top-ten
- •Blog: https://semgrep.dev/blog/
- •GitHub Action: https://github.com/returntocorp/semgrep-action
- •SARIF Spec: https://docs.oasis-open.org/sarif/sarif/v2.1.0/sarif-v2.1.0.html
- •Initial Source: Trail of Bits skills