PDF Translation Workflow

Name: pdf-translation
Rating: 62
Author: wayne930242

Phase 1: Extract PDF

bash

cd scripts
uv run python extract_pdf.py ../data/pdfs/<filename>.pdf

Outputs (in data/markdown/):

File	Purpose
`<name>.md`	Clean text (markitdown)
`<name>_pages.md`	With `<!-- PAGE N -->` markers
`images/<name>/`	Extracted images

Phase 2: Configure Chapters

•Generate template: uv run python split_chapters.py --init
•Edit chapters.json:

json

{
  "source": "data/markdown/<name>_pages.md",
  "output_dir": "docs/src/content/docs",
  "clean_patterns": ["\\(Order #\\d+\\)"],
  "chapters": {
    "section-slug": {
      "title": "章節標題",
      "order": 1,
      "files": {
        "filename": {
          "title": "頁面標題",
          "description": "SEO 描述",
          "pages": [1, 10],
          "order": 0
        }
      }
    }
  }
}

•Split: uv run python split_chapters.py

Phase 3: Translation

For each markdown file in docs/src/content/docs/:

•Identify terms - Extract English game terms
•Check glossary - Lookup in glossary.json
•Translate - Apply consistent terminology
•Preserve structure - Keep frontmatter, headings, lists intact

Translation Rules

•Maintain original meaning, avoid over-localization
•Game mechanics terms: use established translations
•Proper nouns: keep original or use accepted translations
•Format: preserve markdown syntax exactly

Phase 4: Quality Check

Check	Action
Terminology	Verify against `glossary.json`
Completeness	Compare page count with original
Format	Validate frontmatter, links
Preview	`cd docs && bun dev`

Troubleshooting

Issue	Solution
Garbled text	PDF may have non-standard encoding; try different extraction
Missing pages	Check `_pages.md` for page markers
Broken images	Verify image paths in `images/` folder
Format issues	Use `clean_patterns` to remove artifacts