Finding Duplicate Functions
Overview
LLM-generated codebases accumulate semantic duplicates - functions that accomplish identical goals through different implementations. Traditional duplication detection tools miss these because the code looks different even when the intent is the same.
Core principle: Combine classical extraction (shell scripts) with LLM-powered clustering to recognize intent-based similarities that pattern matching misses.
When to Use
Use this skill when:
- •Auditing codebase for consolidation opportunities
- •Reviewing LLM-generated code before merge
- •Preparing for major refactoring
- •Codebase has grown organically without strict architecture
- •Multiple developers (or LLM sessions) added similar functionality
High-risk areas for duplication:
- •Utility functions (string manipulation, date formatting)
- •Validation logic (email, phone, input checks)
- •Error formatting and handling
- •Path manipulation
- •API response transformations
- •Configuration parsing
The Process
Phase 1: Extract Function Signatures
Extract all exported/public functions with their signatures.
# JavaScript/TypeScript grep -rn "export function\|export const.*=.*=>" src/ \ | grep -v node_modules \ | grep -v ".test." \ | grep -v ".spec." \ > functions.txt # Python grep -rn "^def \|^async def " src/ \ | grep -v __pycache__ \ | grep -v test_ \ > functions.txt
Filter out:
- •Internal/private helpers (underscore prefix)
- •Test files
- •Generated code
- •Node modules / vendor directories
Phase 2: Categorize by Domain (Use Haiku)
Group functions by their likely domain/purpose. This is a classification task - fast model is sufficient.
Prompt for Haiku:
"Categorize these function signatures by domain (validation, formatting,
API, data-transformation, etc.). Group similar-intent functions together.
Output as JSON: { "category": ["func1", "func2", ...] }"
Categories must have 3+ functions to warrant duplicate analysis.
Phase 3: Split Into Category Files
# Create file per category for focused analysis
for category in validation formatting api; do
grep -f "${category}_funcs.txt" functions.txt > "${category}_analysis.txt"
done
Phase 4: Detect Duplicates (Use Opus)
For each category with 3+ functions, use Opus for precise semantic analysis.
Prompt for Opus: "Analyze these functions for semantic duplication. Two functions are duplicates if they: 1. Accomplish the same goal (even with different approaches) 2. Could be consolidated without loss of functionality 3. Differ only in naming, error handling style, or minor implementation For each duplicate pair, explain: - What both functions do - How they differ - Which to keep and why - Migration path for callers Output format: DUPLICATE_SET_1: - functionA (file:line) - functionB (file:line) INTENT: [what they both do] KEEP: [which one and why] MIGRATE: [how to update callers]"
Phase 5: Generate Consolidation Report
# Duplicate Function Report ## Summary - Functions analyzed: X - Duplicate sets found: Y - Estimated lines removable: Z ## High-Confidence Duplicates [Sets where consolidation is straightforward] ## Needs Review [Sets where the right choice isn't obvious] ## Recommended Actions 1. [Specific consolidation steps] 2. [Test coverage requirements]
Model Selection Guide
| Phase | Model | Reason |
|---|---|---|
| Categorize | Haiku | Cost-effective, classification is simple |
| Detect | Opus | Precision needed for subtle semantic matches |
| Report | Sonnet | Structured output, moderate complexity |
Critical: Haiku is cost-effective for categorization but misses subtle semantic duplicates. Always use Opus for the actual detection phase.
Common Duplicate Patterns
Pattern 1: Renamed Duplicates
// These do the same thing
function validateEmail(email) { ... }
function isValidEmailAddress(addr) { ... }
function checkEmailFormat(input) { ... }
Pattern 2: Implementation Variants
// Same intent, different approach
function formatDate(d) { return d.toISOString().split('T')[0]; }
function dateToString(date) { return `${date.getFullYear()}-${...}`; }
Pattern 3: Scope Duplicates
// Utilities duplicated in multiple modules
// utils/string.js
function capitalize(s) { ... }
// components/helpers.js
function capitalizeFirst(str) { ... }
Consolidation Checklist
Before removing duplicates:
□ Both functions have test coverage □ All callers identified (grep for function names) □ Edge case behavior matches (or is documented) □ Error handling is consistent □ Performance is acceptable □ Types/signatures are compatible □ Migration path documented
Common Mistakes
| Mistake | Consequence | Prevention |
|---|---|---|
| Skip categorization | Overwhelming analysis | Always group first |
| Use Haiku for detection | Miss subtle duplicates | Use Opus for detection |
| Consolidate without tests | Introduce regressions | Require test coverage |
| Extract internal helpers | False positives | Filter private functions |
| Ignore error handling differences | Runtime failures | Document edge cases |
Automation Script Template
#!/bin/bash
# find-duplicates.sh
PROJECT_DIR="${1:-.}"
OUTPUT_DIR="./duplicate-analysis"
mkdir -p "$OUTPUT_DIR"
# Phase 1: Extract
echo "Extracting function signatures..."
grep -rn "export function\|export const.*=>.*{" "$PROJECT_DIR/src" \
| grep -v node_modules \
| grep -v "\.test\." \
> "$OUTPUT_DIR/functions.txt"
echo "Found $(wc -l < "$OUTPUT_DIR/functions.txt") functions"
# Phase 2-5: Run with Claude
echo "Run categorization with Haiku, then detection with Opus"
echo "Functions extracted to: $OUTPUT_DIR/functions.txt"
Integration with CI
# .github/workflows/duplicate-check.yml
name: Duplicate Function Check
on:
pull_request:
paths: ['src/**/*.ts', 'src/**/*.js']
jobs:
check:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Extract new functions
run: |
git diff --name-only origin/main | \
xargs grep -l "export function" | \
xargs grep "export function" > new_functions.txt
- name: Flag for review if high-risk area
run: |
if grep -q "utils\|helpers\|validation" new_functions.txt; then
echo "::warning::New utility functions added - check for duplicates"
fi
Real-World Impact
From a real codebase audit:
- •47 functions analyzed
- •8 duplicate sets identified
- •~400 lines of code removable
- •3 subtle bugs found (inconsistent error handling)
- •Result: Cleaner codebase, single source of truth