File Categorization
Categories
| Category | Description |
|---|---|
| Config | Configuration files for tools, environments, and build systems |
| Tests | Test files, fixtures, and testing utilities |
| Docs | Documentation, READMEs, and guides |
| Scripts | Standalone executable scripts and automation |
| Source Code | Core application/library source files |
| Data | Data files, datasets, and static assets |
| AI Tooling | AI/ML configs, prompts, and agent definitions |
| Other | Files that don't fit other categories (fallback) |
Category Mapping Rules
Always use exactly these 8 categories — do not invent new ones like "Schema", "Database", or "Reference Data." Map edge cases as follows:
| File Type | Category | Reasoning |
|---|---|---|
| SQL DDL (CREATE TABLE) | Docs | Documents database structure |
| SQL DML (INSERT/SELECT) | Data | Contains or queries data |
| .duckdb, .sqlite files | Data | Database storage |
| schema.json, openapi.yaml | Docs | Specification/contract files |
| Shell scripts (.sh) | Scripts | Executable automation |
| requirements.txt | Config | Dependency configuration |
Directory Exclusion
When scanning directories, the script uses layered exclusion to prevent wasting tokens on useless output:
Layer 1: Always Excluded (non-negotiable)
- •
node_modules/,bower_components/,jspm_packages/ - •
.git/,.svn/,.hg/ - •
__pycache__/,.pytest_cache/,.tox/,.mypy_cache/ - •
venv/,.venv/
Layer 2: .gitignore (if present and parseable)
- •Respects project-specific exclusions
- •Requires
pathspeclibrary; warns if unavailable
Layer 3: Extended Defaults (fallback when no .gitignore)
- •
dist/,build/,out/,_build/,target/ - •
vendor/,coverage/,.nyc_output/,htmlcov/ - •
env/,*.egg-info/
Layer 4: Escape Hatches
- •
--include-ignored: Bypass Layers 2-3 (but NOT Layer 1) - •
--include-all: Bypass ALL layers (use with extreme caution)
Important: .env as a FILE is categorized as Config. .env as a DIRECTORY is excluded.
When reporting results, explain which exclusion path was used:
"Excluded 3 directories via Layer 1 (always-exclude), 2 via .gitignore"
Categorization Priority
Apply rules in this order (first match wins):
- •Directory path:
tests/→ Tests,src/→ Source Code,docs/→ Docs,references/→ Docs - •Schema/spec files:
schema.json,openapi.yaml→ Docs (see patterns.md for full list) - •Filename pattern:
test_*.py→ Tests,*.config.js→ Config - •Extension:
.sh→ Scripts,.csv→ Data,.py→ Source Code - •Content analysis (if still ambiguous): Check for test assertions, CLI parsing, etc.
Categorization Algorithm
function categorize(filePath, content): # PHASE 1: Filename + directory rules category = byLocationOrExtension(filePath) if category != UNKNOWN: return (category, "High") # PHASE 2: Frontmatter refinement fm = extractYAMLFrontmatter(content) if fm indicates config: return (CONFIG, "Medium") if fm indicates ai_tooling: return (AI_TOOLING, "Medium") # PHASE 3: Content structure analysis if looksLikeTest(content): return (TESTS, "Medium") if looksLikeScript(content): return (SCRIPTS, "Medium") if looksLikeSource(content): return (SOURCE_CODE, "Medium") if looksLikeDocs(content): return (DOCS, "Medium") if looksLikeData(content): return (DATA, "Medium") # PHASE 4: Keyword detection (fallback) kw_category = detectByKeywords(content) if kw_category: return (kw_category, "Low") return (OTHER, "Low")
Pattern Matching Details
For complete pattern reference, see references/patterns.md:
- •Directory patterns (test/, src/, config/, etc.)
- •Filename patterns (.config.js, test_.py, etc.)
- •Frontmatter indicators
- •Content structure signals
- •Keyword detection rules
Output Format
For each file, provide:
### [Filename] - **Category**: [Config | Tests | Docs | Scripts | Source Code | Data | AI Tooling | Other] - **Confidence**: [High | Medium | Low] - **Reasoning**: [Why this category was chosen] - **Recommended Location**: [Suggested directory if misfiled, or "Correct" if well-placed]
Confidence Guidelines:
- •High (>95%): Strong directory + extension match (Phase 1)
- •Medium (>80%): Frontmatter or content structure match (Phases 2-3)
- •Low: Keyword match or no clear signals (Phase 4)
Content Analysis Caveats
When using --analyze-content (Phases 2-4):
- •False positives from examples: Files containing test keywords as examples (like documentation showing
assertordescribe()) may be miscategorized as Tests - •Sampling limits: Large files (>5KB) are sampled from the beginning, not fully analyzed
- •Binary files: Non-text files are skipped and categorized as Other
- •Keyword ambiguity: Common words like "class" or "function" appear in both Source Code and Docs
Recommendation: Use Phase 1 (path-based) categorization for speed and reliability. Enable --analyze-content only for ambiguous files or when you need higher accuracy on edge cases.
Examples
Example 1: Clear Categorization
User: What category is src/utils/helpers.py?
Response:
### src/utils/helpers.py - **Category**: Source Code - **Confidence**: High - **Reasoning**: Located in `src/` directory; `.py` extension; module structure - **Recommended Location**: Correct
Example 2: Misfiled Resource
User: Categorize customers.csv in root directory.
Response:
### customers.csv - **Category**: Data - **Confidence**: High - **Reasoning**: CSV extension; tabular structure detected - **Recommended Location**: `data/customers.csv`
Example 3: Ambiguous File
User: What category is notes.txt?
Response:
### notes.txt - **Category**: Other - **Confidence**: Low - **Reasoning**: Prose content; no structural markers; could be Docs if formalized - **Recommended Location**: Manual review needed — consider `docs/` if documentation
Example 4: Project Inventory
User: Categorize all files in this project.
Response:
Config (5): - .gitignore - pyproject.toml - docker-compose.yml Source Code (12): - src/main.py - src/utils/helpers.py Tests (4): - tests/test_main.py - tests/conftest.py Docs (2): - README.md - docs/API.md AI Tooling (1): - .claude/skills/categorizing-files/SKILL.md Other (1): - notes.txt (Low confidence — review needed) Excluded 2 directories: 1 via Layer 1 (always-exclude), 1 via .gitignore
Optional: Automated Script
For large directories or programmatic use, the bundled script implements this methodology with automatic directory exclusion (skips node_modules/, venv/, .git/, etc.):
# Basic usage python scripts/categorize.py [path] # Enable content analysis (Phases 2-4) python scripts/categorize.py --analyze-content [path] # Include .gitignore-excluded files (bypass Layers 2-3) python scripts/categorize.py --include-ignored [path] # Include ALL files including node_modules (use with caution) python scripts/categorize.py --include-all [path]
For a single file:
python scripts/categorize.py myfile.py # Output: myfile.py: Source Code (High)
For a directory:
python scripts/categorize.py . # Output: Grouped list by category with exclusion summary
See scripts/categorize.py for implementation details and programmatic API.