File Categorization

Category Mapping Rules

Always use exactly these 8 categories — do not invent new ones like "Schema", "Database", or "Reference Data." Map edge cases as follows:

File Type	Category	Reasoning
SQL DDL (CREATE TABLE)	Docs	Documents database structure
SQL DML (INSERT/SELECT)	Data	Contains or queries data
.duckdb, .sqlite files	Data	Database storage
schema.json, openapi.yaml	Docs	Specification/contract files
Shell scripts (.sh)	Scripts	Executable automation
requirements.txt	Config	Dependency configuration

Directory Exclusion

When scanning directories, the script uses layered exclusion to prevent wasting tokens on useless output:

Layer 1: Always Excluded (non-negotiable)

•node_modules/, bower_components/, jspm_packages/
•.git/, .svn/, .hg/
•__pycache__/, .pytest_cache/, .tox/, .mypy_cache/
•venv/, .venv/

Layer 2: .gitignore (if present and parseable)

•Respects project-specific exclusions
•Requires pathspec library; warns if unavailable

Layer 3: Extended Defaults (fallback when no .gitignore)

•dist/, build/, out/, _build/, target/
•vendor/, coverage/, .nyc_output/, htmlcov/
•env/, *.egg-info/

Layer 4: Escape Hatches

•--include-ignored: Bypass Layers 2-3 (but NOT Layer 1)
•--include-all: Bypass ALL layers (use with extreme caution)

Important: .env as a FILE is categorized as Config. .env as a DIRECTORY is excluded.

When reporting results, explain which exclusion path was used:

"Excluded 3 directories via Layer 1 (always-exclude), 2 via .gitignore"

Categorization Priority

Apply rules in this order (first match wins):

•Directory path: tests/ → Tests, src/ → Source Code, docs/ → Docs, references/ → Docs
•Schema/spec files: schema.json, openapi.yaml → Docs (see patterns.md for full list)
•Filename pattern: test_*.py → Tests, *.config.js → Config
•Extension: .sh → Scripts, .csv → Data, .py → Source Code
•Content analysis (if still ambiguous): Check for test assertions, CLI parsing, etc.

Categorization Algorithm

text

function categorize(filePath, content):

  # PHASE 1: Filename + directory rules
  category = byLocationOrExtension(filePath)
  if category != UNKNOWN: return (category, "High")

  # PHASE 2: Frontmatter refinement
  fm = extractYAMLFrontmatter(content)
  if fm indicates config: return (CONFIG, "Medium")
  if fm indicates ai_tooling: return (AI_TOOLING, "Medium")

  # PHASE 3: Content structure analysis
  if looksLikeTest(content): return (TESTS, "Medium")
  if looksLikeScript(content): return (SCRIPTS, "Medium")
  if looksLikeSource(content): return (SOURCE_CODE, "Medium")
  if looksLikeDocs(content): return (DOCS, "Medium")
  if looksLikeData(content): return (DATA, "Medium")

  # PHASE 4: Keyword detection (fallback)
  kw_category = detectByKeywords(content)
  if kw_category: return (kw_category, "Low")

  return (OTHER, "Low")

Pattern Matching Details

For complete pattern reference, see references/patterns.md:

•Directory patterns (test/, src/, config/, etc.)
•Filename patterns (.config.js, test_.py, etc.)
•Frontmatter indicators
•Content structure signals
•Keyword detection rules

Output Format

For each file, provide:

markdown

### [Filename]

- **Category**: [Config | Tests | Docs | Scripts | Source Code | Data | AI Tooling | Other]
- **Confidence**: [High | Medium | Low]
- **Reasoning**: [Why this category was chosen]
- **Recommended Location**: [Suggested directory if misfiled, or "Correct" if well-placed]

Confidence Guidelines:

•High (>95%): Strong directory + extension match (Phase 1)
•Medium (>80%): Frontmatter or content structure match (Phases 2-3)
•Low: Keyword match or no clear signals (Phase 4)

Content Analysis Caveats

When using --analyze-content (Phases 2-4):

•False positives from examples: Files containing test keywords as examples (like documentation showing assert or describe()) may be miscategorized as Tests
•Sampling limits: Large files (>5KB) are sampled from the beginning, not fully analyzed
•Binary files: Non-text files are skipped and categorized as Other
•Keyword ambiguity: Common words like "class" or "function" appear in both Source Code and Docs

Recommendation: Use Phase 1 (path-based) categorization for speed and reliability. Enable --analyze-content only for ambiguous files or when you need higher accuracy on edge cases.

Examples

Example 1: Clear Categorization

User: What category is src/utils/helpers.py?

Response:

markdown

### src/utils/helpers.py
- **Category**: Source Code
- **Confidence**: High
- **Reasoning**: Located in `src/` directory; `.py` extension; module structure
- **Recommended Location**: Correct

Example 2: Misfiled Resource

User: Categorize customers.csv in root directory.

Response:

markdown

### customers.csv
- **Category**: Data
- **Confidence**: High
- **Reasoning**: CSV extension; tabular structure detected
- **Recommended Location**: `data/customers.csv`

Example 3: Ambiguous File

User: What category is notes.txt?

Response:

markdown

### notes.txt
- **Category**: Other
- **Confidence**: Low
- **Reasoning**: Prose content; no structural markers; could be Docs if formalized
- **Recommended Location**: Manual review needed — consider `docs/` if documentation

Example 4: Project Inventory

User: Categorize all files in this project.

Response:

code

Config (5):
  - .gitignore
  - pyproject.toml
  - docker-compose.yml

Source Code (12):
  - src/main.py
  - src/utils/helpers.py

Tests (4):
  - tests/test_main.py
  - tests/conftest.py

Docs (2):
  - README.md
  - docs/API.md

AI Tooling (1):
  - .claude/skills/categorizing-files/SKILL.md

Other (1):
  - notes.txt (Low confidence — review needed)

Excluded 2 directories: 1 via Layer 1 (always-exclude), 1 via .gitignore

Optional: Automated Script

For large directories or programmatic use, the bundled script implements this methodology with automatic directory exclusion (skips node_modules/, venv/, .git/, etc.):

bash

# Basic usage
python scripts/categorize.py [path]

# Enable content analysis (Phases 2-4)
python scripts/categorize.py --analyze-content [path]

# Include .gitignore-excluded files (bypass Layers 2-3)
python scripts/categorize.py --include-ignored [path]

# Include ALL files including node_modules (use with caution)
python scripts/categorize.py --include-all [path]

For a single file:

bash

python scripts/categorize.py myfile.py
# Output: myfile.py: Source Code (High)

For a directory:

bash

python scripts/categorize.py .
# Output: Grouped list by category with exclusion summary

See scripts/categorize.py for implementation details and programmatic API.

Category	Description
Config	Configuration files for tools, environments, and build systems
Tests	Test files, fixtures, and testing utilities
Docs	Documentation, READMEs, and guides
Scripts	Standalone executable scripts and automation
Source Code	Core application/library source files
Data	Data files, datasets, and static assets
AI Tooling	AI/ML configs, prompts, and agent definitions
Other	Files that don't fit other categories (fallback)

File Categorization

Categories

Category Mapping Rules

Directory Exclusion

Categorization Priority

Categorization Algorithm

Pattern Matching Details

Output Format

Content Analysis Caveats

Examples

Example 1: Clear Categorization

Example 2: Misfiled Resource

Example 3: Ambiguous File

Example 4: Project Inventory

Optional: Automated Script