Python Filename Sanitization with Fallback
Purpose
This skill provides expertise in implementing robust filename sanitization that goes beyond removing invalid characters to ensure the resulting filename meets downstream validation requirements. It addresses the critical gap where user-provided titles containing only special characters (emoji, ellipsis, whitespace) produce invalid filenames after standard sanitization.
Core Problem
Conventional Approach Limitation: Standard filename sanitization only removes invalid filesystem characters:
# Incomplete sanitization - can produce invalid names
filename = title.replace('/', '-').replace(':', ' -')
filename = re.sub(r'[<>"\\\|?*]', '', filename)
Real-World Failure Cases (from Readwise MCP Server, 2026-01-24):
- •Title:
"…"→ Filename:"….md"→ qmd indexer error - •Title:
"🍿🍿"→ Filename:"🍿🍿.md"→ qmd indexer error - •Title:
""→ Filename:".md"→ qmd indexer error
Root Issue: Many indexers and tools require filenames to contain at least one alphanumeric character (/[\p{L}\p{N}]/u). Titles with only special characters fail this validation.
The Validation-Aware Pattern
Core Principle
Validate output, not just transform input. After sanitization, verify the filename meets downstream requirements; fallback to metadata-based naming when validation fails.
Implementation Pattern
def sanitize_filename(title: str, doc: Optional[Dict] = None) -> str:
"""
Sanitize title for filename with fallback for invalid names.
Args:
title: The document title to sanitize
doc: Optional document dict for fallback metadata (author, saved_at, category)
Returns:
Sanitized filename ending in .md, guaranteed to have alphanumeric content
"""
# Step 1: Standard character sanitization
filename = title.replace('/', '-').replace(':', ' -')
filename = re.sub(r'[<>"\\\|?*]', '', filename)
filename = filename[:100].strip()
# Step 2: Validate output meets requirements
if not any(c.isalnum() for c in filename):
# Step 3: Intelligent fallback using available metadata
if doc:
author = doc.get('author', 'Unknown')
# Sanitize author name (may also have special chars)
author = re.sub(r'[<>"\\\|?*/:]', '', author)[:30].strip()
saved_at = doc.get('saved_at', '')
date_str = saved_at[:10] if saved_at else datetime.now().strftime('%Y-%m-%d')
# Use category for context
category = doc.get('category', 'Document')
category_label = 'Tweet' if category == 'tweet' else category.capitalize()
filename = f"{category_label} by {author} - {date_str}"
else:
# Generic timestamp-based fallback
filename = f"Untitled - {datetime.now().strftime('%Y-%m-%d-%H%M%S')}"
return filename + ".md"
Key Components
1. Validation Check:
if not any(c.isalnum() for c in filename):
Catches: empty strings, emoji-only, special chars only, whitespace only
2. Metadata-Based Fallback:
filename = f"{category_label} by {author} - {date_str}"
Benefits:
- •Descriptive: Shows document type and origin
- •Unique: Date ensures no collisions
- •Searchable: Author name aids discovery
- •Valid: Guaranteed alphanumeric content
3. Defensive Author Sanitization:
author = re.sub(r'[<>"\\\|?*/:]', '', author)[:30].strip()
Authors may also contain special characters; sanitize and truncate them too.
Real-World Application
Context: Readwise MCP Server (December 2025 Backfill)
Scenario: Importing 985 tweets from Readwise API Challenge: Some tweets have titles with only emoji or special characters Impact: 3 files created with invalid names broke qmd indexer
Results:
| Input Title | Standard Sanitization | Enhanced Sanitization |
|---|---|---|
"…" | "….md" ❌ | "Tweet by Take Action! - 2025-12-08.md" ✓ |
"🍿🍿" | "🍿🍿.md" ❌ | "Tweet by Elon Musk - 2025-12-06.md" ✓ |
"" | ".md" ❌ | "Tweet by x.com - 2025-12-05.md" ✓ |
"Normal Title" | "Normal Title.md" ✓ | "Normal Title.md" ✓ |
Outcome:
- •All 985 documents imported successfully with enhanced pattern
- •Zero indexing errors
- •Descriptive fallback names improved organization
- •100% backward compatible (existing valid filenames unchanged)
Application Points
Update all file creation callsites:
# Before (incomplete)
filename = sanitize_filename(doc["title"])
# After (defensive)
filename = sanitize_filename(doc.get("title", ""), doc)
Critical: Pass full document dict to enable metadata fallback.
Testing Strategy
Essential Test Cases
def test_ellipsis_only_title():
"""Test emoji/special char-only titles use fallback"""
doc = {"title": "…", "author": "User", "saved_at": "2025-12-08", "category": "tweet"}
result = sanitize_filename("…", doc)
assert result == "Tweet by User - 2025-12-08.md"
assert any(c.isalnum() for c in result[:-3]) # Verify validation
def test_empty_title():
"""Test empty titles use fallback"""
doc = {"title": "", "author": "User", "saved_at": "2025-12-05", "category": "tweet"}
result = sanitize_filename("", doc)
assert any(c.isalnum() for c in result[:-3])
def test_normal_title():
"""Test normal titles are unchanged"""
result = sanitize_filename("Normal Title", None)
assert result == "Normal Title.md"
def test_mixed_valid_invalid():
"""Test titles with some alphanumeric use original"""
result = sanitize_filename("Hello 🍿 World", None)
assert "Hello" in result and "World" in result
assert any(c.isalnum() for c in result[:-3])
def test_no_doc_fallback():
"""Test fallback works without metadata"""
result = sanitize_filename("…", None)
assert result.startswith("Untitled - ")
assert any(c.isalnum() for c in result[:-3])
Regression Test
def test_validation_regression():
"""Prevent regression of indexer validation errors"""
problematic_cases = [
("…", {"author": "User1", "saved_at": "2025-12-08", "category": "tweet"}),
("🍿🍿", {"author": "User2", "saved_at": "2025-12-06", "category": "tweet"}),
("", {"author": "User3", "saved_at": "2025-12-05", "category": "tweet"}),
(" ", {"author": "User4", "saved_at": "2025-12-04", "category": "tweet"}),
("!@#$%", {"author": "User5", "saved_at": "2025-12-03", "category": "tweet"}),
]
for title, doc in problematic_cases:
result = sanitize_filename(title, doc)
filename_without_ext = result[:-3]
has_alnum = any(c.isalnum() for c in filename_without_ext)
assert has_alnum, f"Title '{title}' produced invalid filename '{result}'"
Edge Cases
1. Author with Special Characters
Author names may also contain invalid characters:
author = re.sub(r'[<>"\\\|?*/:]', '', author)[:30].strip()
Example:
- •Author:
"User/Name:Test"→"UserNameTest"
2. Very Long Author Names
Truncate to prevent excessively long filenames:
author[:30].strip()
3. Category-Specific Labels
Use meaningful labels based on document type:
category_label = 'Tweet' if category == 'tweet' else category.capitalize()
Examples:
- •
category="tweet"→"Tweet by Author - 2025-12-08.md" - •
category="article"→"Article by Author - 2025-12-03.md" - •
category="pdf"→"Pdf by Author - 2025-12-10.md"
4. Mixed Valid/Invalid Characters
If title has SOME alphanumeric content, use the original:
# "Hello 🍿 World" has valid chars, don't use fallback
if any(c.isalnum() for c in filename):
return filename + ".md" # Keep original
Defensive Programming Principles
Principle 1: Validate Output, Not Just Input
Don't assume sanitization produces valid names; verify requirements are met.
Principle 2: Graceful Degradation
When primary approach fails, fallback to alternative strategy using available context.
Principle 3: Metadata Enrichment
Use document metadata to create meaningful fallback names rather than generic placeholders.
Principle 4: Idempotency
Same input with same metadata always produces same output; deterministic behavior aids debugging.
Principle 5: Backward Compatibility
Enhanced validation shouldn't break existing valid filenames; only activate for edge cases.
When to Apply This Pattern
✓ Apply when:
- •Generating filenames from user input (titles, names, descriptions)
- •Working with API data (tweets, articles, documents)
- •Files will be indexed by tools with validation requirements
- •Filesystem compatibility across platforms is critical
- •Content may contain emoji, special characters, or non-Latin scripts
✗ Not needed when:
- •Filenames are programmatically generated (UUIDs, hashes)
- •User input is already validated/constrained
- •Files are temporary and won't be indexed
- •System tolerates any filename format
Integration with Other Skills
Defensive Programming Patterns
- •Input validation alone is insufficient; validate output
- •Multiple validation layers (character removal → alphanumeric check → fallback)
- •Fail gracefully with context-aware alternatives
Test-Driven Development
- •Write tests for edge cases BEFORE encountering them in production
- •Regression tests prevent reintroduction of fixed bugs
- •Property-based testing: "All filenames must have alphanumeric chars"
API Integration Best Practices
- •Don't trust external data format assumptions
- •Metadata-driven fallbacks for robustness
- •Document edge cases discovered in production
Cross-Domain Pattern Recognition
- •Similar validation patterns apply across domains (URLs, identifiers, paths)
- •Fallback strategy: primary approach → metadata-based → generic timestamp
- •Universal principle: Validate requirements, not just constraints
Common Mistakes
Mistake 1: Assuming Sanitization Produces Valid Names
Wrong:
filename = remove_invalid_chars(title) # Might be empty! save_file(filename) # Could fail
Right:
filename = sanitize_filename(title, doc) # Validates + fallback save_file(filename) # Guaranteed valid
Mistake 2: Generic Fallbacks Without Context
Wrong:
if not valid(filename):
filename = "untitled.md" # All failures get same name!
Right:
if not valid(filename):
filename = f"Tweet by {author} - {date}.md" # Unique + descriptive
Mistake 3: Not Sanitizing Metadata
Wrong:
filename = f"Tweet by {author} - {date}.md" # Author might have invalid chars!
Right:
author = sanitize_author(author) # Clean metadata too
filename = f"Tweet by {author} - {date}.md"
Mistake 4: Overly Complex Validation
Wrong:
if not re.match(r'^[a-zA-Z0-9][a-zA-Z0-9\s\-_\.]*[a-zA-Z0-9]$', filename): # Too strict!
Right:
if not any(c.isalnum() for c in filename): # Simple, effective validation
Strategic Value
Production Robustness
- •Prevents runtime failures from unexpected input
- •Reduces manual intervention (no file renaming needed)
- •Improves user experience (meaningful names vs. errors)
Operational Efficiency
- •Zero manual fixes required during 985-document import
- •Automated recovery from edge cases
- •Self-documenting fallback names aid troubleshooting
Knowledge Transfer
- •Pattern applies beyond Python (JavaScript, Ruby, Go, etc.)
- •Defensive strategy works across domains (API design, database schemas, URLs)
- •Validation + fallback principle is universal
Source Documentation
Origin: Readwise MCP Server Enhancement (2026-01-24)
- •Problem: qmd indexer validation errors during December 2025 backfill
- •Impact: 3 invalid filenames out of 985 documents imported
- •Solution: Enhanced sanitization with validation + metadata fallback
- •Result: 100% success rate, zero manual intervention
- •Repository:
/Users/ngpestelos/src/readwise-mcp-server - •Commit:
9c361e8 - •Tests: 13 new tests in
TestFilenameQmdValidationclass - •Documentation:
0 Projects/Readwise MCP Enhancement - Filename Sanitization.md
When to Apply This Skill
Use Python Filename Sanitization with Fallback expertise when:
- •Building systems that import user-generated content
- •Integrating with APIs that provide unpredictable text data
- •Working with indexing tools that validate filename format
- •Implementing document management systems
- •Creating robust file-based workflows
- •Ensuring cross-platform filesystem compatibility
- •Preventing production failures from edge-case input
- •Designing defensive, self-healing systems
Quality Note: This pattern demonstrates defensive programming in practice—going beyond "remove bad characters" to "ensure valid output with graceful degradation." The metadata-enriched fallback strategy transforms potential failures into opportunities for improved organization and user experience.