PDF Translation Skill
Translate PDF documents between any language pair with optimized support for academic papers and documents with complex layouts.
When to Use This Skill
Use this skill when:
- •User wants to translate a PDF document to another language
- •User has academic papers, research documents, or technical manuals to translate
- •User needs to preserve document structure (tables, headings, lists) during translation
- •User wants both Markdown and PDF output formats
- •User mentions translating documents with tables, figures, or complex layouts
Usage
/pdf-translator <pdf_path> [options]
Arguments
- •
<pdf_path>: PDF file or directory containing PDFs
Options
| Option | Description | Default |
|---|---|---|
--source-lang | Source language code (auto for detection) | auto |
--target-lang | Target language code | ko |
--output-format | Output format (markdown/pdf/both) | both |
--output-dir | Output directory | ./translated |
--parallel | Concurrent agents | 5 |
--dict | Custom dictionary (JSON) | none |
--high-quality | Use Opus model for translation | false |
--academic | Academic document mode | false |
--term-style | Term annotation style (parenthesis/footnote/inline) | parenthesis |
--first-occurrence | Annotate terms only on first occurrence | true |
--describe-images | Add AI-generated image descriptions | false |
Language Codes
ja (Japanese), en (English), ko (Korean), zh (Chinese), es (Spanish), fr (French), de (German), ru (Russian), ar (Arabic), he (Hebrew), or any ISO 639-1 code.
Examples
# Basic translation (English PDF to Korean) /pdf-translator "/docs/manual.pdf" # Japanese academic paper to Korean with terminology annotations /pdf-translator "/papers/research.pdf" --source-lang ja --academic # High-quality translation using Opus model /pdf-translator "/papers/important.pdf" --high-quality # Academic mode with footnote-style term annotations /pdf-translator "/papers/thesis.pdf" --academic --term-style footnote # Batch translation of a directory /pdf-translator "/docs/" --target-lang ko --parallel 10 # Markdown output only /pdf-translator "/books/novel.pdf" --output-format markdown # Arabic RTL document to English /pdf-translator "/docs/arabic.pdf" --source-lang ar --target-lang en
Architecture
flowchart LR
PDF[PDF] --> Extract[extract_to_markdown.py]
Extract --> Source[source.md + images/]
Source --> Split[split_markdown.py]
Split --> Sections[sections/]
Sections --> Translate[Translator Agents]
Translate --> Merged[translated.md]
Merged --> GenPDF[generate_pdf.py]
GenPDF --> Output[translated.pdf]
Key Features:
- •Full document context preserved during translation
- •Clean Markdown intermediate format (human-readable, editable)
- •Section-based parallel translation for large documents
- •pandoc + weasyprint for high-quality PDF output
Execution Workflow
Phase 0: Environment Setup
bash scripts/setup_env.sh
This installs pandoc and creates .venv/ with Python dependencies (pymupdf, pdfplumber, weasyprint).
PYTHON=".venv/bin/python"
Phase 1: Extract PDF to Markdown
WORK_DIR="/tmp/pdf_translate_$(date +%s)"
$PYTHON scripts/extract_to_markdown.py \
--pdf "{PDF_PATH}" \
--output-dir "$WORK_DIR" \
--source-lang en \
--target-lang ko
Output:
- •
$WORK_DIR/source.md- Original Markdown (preserves structure) - •
$WORK_DIR/images/- Extracted images - •
$WORK_DIR/metadata.json- Document metadata
Phase 2: Split Markdown (if needed)
For large documents (>6000 tokens), split into sections:
$PYTHON scripts/split_markdown.py \ --input "$WORK_DIR/source.md" \ --output-dir "$WORK_DIR/sections" \ --max-tokens 6000
Output:
- •
$WORK_DIR/sections/section_001.md - •
$WORK_DIR/sections/section_002.md - •
$WORK_DIR/sections/sections_manifest.json
Phase 3: Translation
Translate each section using the Markdown translator guide (references/translator_markdown.md).
Option A: Direct Translation (Small documents)
The orchestrator translates the Markdown directly:
- •Read
source.md(or each section) - •Translate following
translator_markdown.mdguidelines - •Write to
translated.md
Option B: Parallel Translation (Large documents)
Spawn Task agents for each section:
Task(
subagent_type: "general-purpose",
model: "sonnet", // or "opus" for --high-quality
run_in_background: false,
prompt: "Read references/translator_markdown.md for guidelines.
Translate $WORK_DIR/sections/section_001.md from English to Korean.
Write output to $WORK_DIR/translated/section_001.md"
)
Phase 4: Merge Translated Sections
If split, merge translated sections:
cat $WORK_DIR/translated/section_*.md > $WORK_DIR/translated.md
Phase 5: Generate PDF
$PYTHON scripts/generate_pdf.py \
--markdown "$WORK_DIR/translated.md" \
--output "$OUTPUT_DIR/{filename}_translated.pdf"
Phase 6: Validation (Optional)
Review output for:
- •Markdown formatting preserved
- •Tables rendered correctly
- •Images referenced properly
- •No untranslated text
Model Selection
Default (no flags)
| Task | Model |
|---|---|
| Markdown translation | Sonnet |
| Validation | Haiku |
With --high-quality
| Task | Model |
|---|---|
| Markdown translation | Opus |
| Validation | Sonnet |
Academic Mode (--academic)
When enabled:
- •Technical terms include original language in parentheses
- •Abbreviations expanded on first occurrence
- •Citations and references preserved
- •Formal academic writing style maintained
Term Annotation Styles (--term-style)
| Style | Example |
|---|---|
parenthesis | 기계 학습(Machine Learning) |
footnote | 기계 학습¹ |
inline | 기계 학습/Machine Learning |
First Occurrence (--first-occurrence)
When true (default):
- •First mention: 기계 학습(Machine Learning)
- •Subsequent: 기계 학습
When false:
- •All mentions include original term
Language-Specific Processing
| Source | Special Handling |
|---|---|
| Japanese | Vertical→horizontal writing, ruby tag removal |
| Chinese | Traditional/simplified handling, vertical→horizontal |
| Arabic/Hebrew | RTL→LTR conversion, text direction adjustment |
| English | Standard processing |
| Target | Special Handling |
|---|---|
| Korean | Translationese removal, natural expression check |
Custom Dictionary (Optional)
The translator works without external dictionary files. It naturally translates based on context.
Use custom dictionaries ONLY for:
- •Proper nouns: names, places, organizations, brands
- •Document-specific terms: proprietary terms unique to this document
Do NOT add common words - let the translator handle them naturally.
Creating a Custom Dictionary
Use the --dict option with a JSON file:
{
"metadata": {
"source_language": "en",
"target_language": "ko",
"document_title": "Annual Report 2024"
},
"proper_nouns": {
"names": { "John Smith": "존 스미스" },
"places": { "Silicon Valley": "실리콘밸리" },
"organizations": { "OpenAI": "OpenAI" }
},
"domain_terms": {
"ProprietaryTech": "고유 기술명"
},
"preserve_original": {
"terms": ["API", "GPU", "URL"]
},
"abbreviations": {
"LLM": "Large Language Model"
},
"style_notes": {
"notes": ""
}
}
Templates
| Template | Use Case |
|---|---|
| assets/template.json | General documents |
| assets/template_academic.json | Academic papers, technical documents |
Work Directory Structure
$WORK_DIR/ ├── source.md # Original Markdown (extracted from PDF) ├── metadata.json # Document metadata (title, pages, languages) ├── images/ # Extracted images │ ├── page001_img000.png │ ├── page002_img000.png │ └── ... ├── sections/ # Split sections (for large documents) │ ├── section_001.md │ ├── section_002.md │ └── sections_manifest.json ├── translated/ # Translated sections │ ├── section_001.md │ ├── section_002.md │ └── ... └── translated.md # Final merged translation
Error Handling
| Error | Action |
|---|---|
| PDF extraction failure | Skip corrupted file, report |
| Translation timeout | Retry with smaller chunks |
| Table extraction failure | Treat as text block |
| Layout preservation failure | Fallback to Markdown only |
| Low quality score | Re-translate with Opus |
Text Processing
The following automatic text cleanup is applied during extraction and output generation:
| Issue | Fix Applied |
|---|---|
| Corrupted characters (●) | Restored to parentheses |
| Broken URLs (spaces) | Spaces removed, domains fixed |
| Missing @ in emails | Restored based on pattern |
| Artifact text (a1111111111) | Filtered out |
| Small images (logos, icons) | Filtered (min 200x100) |
| Page headers/footers | Auto-detected and removed |
| Superscript numbers | Converted to ^[n] format |
| Reference section | Auto-formatted with merged entries |
| Concatenated words | Split using wordninja (thepatient → the patient) |
| Reversed text | Detected and corrected (rewol → lower) |
| Missing punctuation spaces | Added (text.Next → text. Next) |
| Table text spacing | Improved with x_tolerance parameter |
PDF Extraction Error Correction
Some complex PDF artifacts cannot be fully corrected during extraction. The translator prompts include instructions to recognize and correct remaining errors:
- •Split medical/scientific terms:
broncho alveolar→bronchoalveolar - •Single-letter fragments:
Diaphragm a tic→Diaphragmatic
See references/translator_markdown.md and references/translator_academic.md for details.
Header/Footer Detection
Automatically detects and removes common header/footer patterns:
- •DOI links (
https://doi.org/...) - •Journal volume/issue patterns (
Journal| (2024) 16:642) - •Page numbers (standalone numbers at page boundaries)
- •Date stamps (
Received: 23 July 2024) - •Copyright notices
Superscript Handling
Reference numbers and author affiliations are detected by font size and converted to standard format:
- •
word¹→word^[1] - •
Author¹,²→Author^[1,2]
Reference Section Processing
Multi-language support for reference section headers:
- •English: References, Bibliography, Works Cited
- •Korean: 참고문헌
- •German: Literatur, Literaturverzeichnis
- •French: Références, Bibliographie
- •Chinese/Japanese: 参考文献
List Detection
Automatically detects and formats various list styles:
- •Bullet:
•,·,-,*,▪,▸,► - •Numbered:
1.,1),(1),① - •Roman:
i.,ii.,iii. - •Letter:
a.,a),(a)
Output Formats
Markdown Output
- •Preserves document structure (headings, paragraphs, lists)
- •Tables converted to Markdown tables
- •URLs converted to clickable links
- •Metadata in YAML frontmatter
PDF Output
- •Generated via pandoc + weasyprint from Markdown
- •Clean text rendering with system fonts (Pretendard preferred)
- •Proper table rendering with borders and headers
- •Clickable links with styling
- •Page numbers at bottom
File Reference
| Path | Description |
|---|---|
SKILL.md | This file |
references/orchestrator.md | Orchestrator workflow guide |
references/translator_markdown.md | Markdown translation guidelines |
references/translator_academic.md | Academic document translation |
references/validator_generic.md | Generic validation instruction |
references/validator_ko.md | Korean-specific validation |
scripts/setup_env.sh | Environment setup (installs pandoc, Python dependencies) |
scripts/extract_to_markdown.py | PDF extraction to Markdown with images |
scripts/split_markdown.py | Split large Markdown into sections by token count |
scripts/generate_pdf.py | PDF output generation (Markdown → PDF via pandoc + weasyprint) |
assets/template.json | Dictionary template for general documents |
assets/template_academic.json | Dictionary template for academic documents |
Known Limitations
PDF extraction has inherent limitations due to the format's nature:
| Limitation | Description | Workaround |
|---|---|---|
| Figure text extraction | Text inside charts/graphs/diagrams may be extracted as body text | Manual review of figure areas |
| Complex table structures | Tables with merged cells or nested structures may not parse correctly | Tables extracted as best-effort Markdown |
| Multi-column layouts | Two-column academic papers may have text order issues | Usually handled correctly, but verify flow |
| Scanned PDFs | Image-based PDFs require OCR (not included) | Use OCR tools first, then translate |
| Mathematical formulas | LaTeX/MathML may not render perfectly | Formulas preserved as-is when possible |
Quality Expectations
- •Academic papers: 85-95% accuracy on text extraction
- •Technical manuals: 80-90% accuracy
- •Complex layouts: 70-85% accuracy (flowcharts, multi-column)
- •Tables: Variable (depends on structure complexity)
For best results with complex documents, review the extracted source.md before translation and manually correct any extraction errors.