Data Cleanup Skill

Run data enrichment, cleanup, and migration pipelines with built-in validation.

When to Use

•"Clean up the data"
•"Run the enrichment pipeline"
•"Fix media tags" / "Update story embeddings"
•Any bulk data operation on Supabase
•Headless automation of data pipelines

Workflow

1. Pre-Flight Assessment

Before running any data operation:

bash

# Check what scripts are available
ls web-platform/scripts/*.ts

# Check database state via Supabase MCP
# Use mcp__supabase__execute_sql to query current counts

Understand what you're working with:

•How many records will be affected?
•Is this reversible? (If not, confirm with user)
•Are there dependent tables that will be affected?

2. Snapshot Before Changes

Always capture the "before" state:

sql

-- Example: count records before cleanup
SELECT count(*) as total,
       count(CASE WHEN tags IS NOT NULL THEN 1 END) as tagged,
       count(CASE WHEN embeddings IS NOT NULL THEN 1 END) as embedded
FROM stories;

3. Execute Pipeline

Run the appropriate script. Available pipelines:

Script	Purpose
`scripts/auto-assign-stories.ts`	Auto-assign stories to services
`scripts/check-embeddings.ts`	Verify embedding coverage
`scripts/check-knowledge-entries.ts`	Validate knowledge base
`scripts/enrich-media-tags.ts`	AI-powered media tagging
`scripts/fix-image-urls.ts`	Fix broken image URLs
`scripts/fix-event-tags.ts`	Clean up event tags
`scripts/add-content-type-tags.ts`	Add content type classifications
`scripts/generate-stories-from-interviews.ts`	Generate stories from interview transcripts
`scripts/link-stories-and-quotes.ts`	Link stories with their quotes
`scripts/import-picc-website-gallery.ts`	Import gallery from PICC website
`scripts/import-board-photos.ts`	Import board member photos

4. Validate Results

After running a pipeline, verify changes took effect:

sql

-- Example: compare after cleanup
SELECT count(*) as total,
       count(CASE WHEN tags IS NOT NULL THEN 1 END) as tagged,
       count(CASE WHEN embeddings IS NOT NULL THEN 1 END) as embedded
FROM stories;

Compare with the "before" snapshot. Report:

•Records processed
•Records changed
•Records skipped (and why)
•Any errors encountered

5. Self-Correcting Loop

If the pipeline produces errors:

•Read the error output carefully
•Identify the root cause (missing column? wrong data type? API rate limit?)
•Fix the underlying issue
•Re-run ONLY the failed portion
•Validate again

Do NOT re-run the entire pipeline if only a subset failed.

Headless Automation

For unattended data operations, use Claude Code headless mode:

bash

# Run all pending data enrichment
claude -p "Run data-cleanup skill: check embeddings, fix image URLs, and enrich media tags. Report results." \
  --allowedTools "Bash,Read,Write,Grep,Glob,mcp__supabase__execute_sql"

# Pre-report data validation
claude -p "Run data-validate skill for year 2025, then run data-cleanup for any gaps found." \
  --allowedTools "Bash,Read,Grep,Glob,mcp__supabase__execute_sql"

Safety Rules

•NEVER delete data without explicit user confirmation
•ALWAYS snapshot before bulk operations
•For operations affecting > 100 records, show a sample of 5 before proceeding
•Rate limit external API calls (AI tagging, image processing) to avoid hitting limits
•If a script fails partway through, note which records were processed so you can resume