Bilingual Content Generation

Objectives

•Generate bilingual markdown from structured content
•Preserve original text with translation placeholders
•Handle code blocks, images, and formatting correctly
•Support incremental translation workflow

Key Instructions

1. Input Format

Expect structured JSON data with sections:

json

{
  "title": "Article Title",
  "sections": [
    {"type": "h2", "text": "Section Header"},
    {"type": "p", "text": "Paragraph content"},
    {"type": "code", "text": "code content", "language": "python"},
    {"type": "image", "src": "url", "local_path": "images/img.png", "caption": "..."}
  ]
}

2. Generate Bilingual Markdown

python

def generate_bilingual_markdown(data: dict, output_path: Path):
    md_lines = []
    
    # Header with original title
    md_lines.append(f"# {data['title']} (中英对照)\n")
    md_lines.append(f"> **原文链接:** {data.get('source_url', '')}\n")
    md_lines.append("---\n")
    
    last_image_hash = None
    
    for section in data['sections']:
        stype = section['type']
        
        # Headers
        if stype == 'h2':
            md_lines.append(f"\n## {section['text']}\n")
            md_lines.append("\n[待翻译]\n")
        elif stype == 'h3':
            md_lines.append(f"\n### {section['text']}\n")
            md_lines.append("\n[待翻译]\n")
        
        # Paragraphs
        elif stype == 'p':
            md_lines.append(f"\n{section['text']}\n")
            md_lines.append("\n[待翻译]\n")
        
        # Code blocks (no translation needed)
        elif stype == 'code':
            lang = section.get('language', 'python')
            md_lines.append(f"\n```{lang}\n{section['text']}\n```\n")
        
        # Images (deduplicate by hash)
        elif stype == 'image':
            if 'local_path' in section:
                current_hash = extract_hash(section['local_path'])
                if current_hash != last_image_hash:
                    md_lines.append(f"\n![{section.get('alt', '')}]({section['local_path']})\n")
                    if section.get('caption'):
                        md_lines.append(f"*{section['caption']}*\n")
                    last_image_hash = current_hash
    
    # Write output
    with open(output_path, 'w', encoding='utf-8') as f:
        f.writelines(md_lines)

3. Handle Special Cases

Skip UI elements:

python

# Filter out common UI noise
skip_texts = ['--', 'Listen', 'Share', 'Follow', 'Sign up', 'Sign in']
if section['type'] == 'p' and section['text'] in skip_texts:
    continue

Deduplicate images:

python

import re

def extract_hash(filename: str) -> str:
    """Extract hash from filename like img_16_21c1b3df.png"""
    match = re.search(r'img_\d+_([a-f0-9]+)\.\w+', filename)
    return match.group(1) if match else None

Preserve code formatting:

python

# Don't add translation placeholders for code blocks
if stype == 'code':
    md_lines.append(f"\n```{section.get('language', '')}\n{section['text']}\n```\n")
    # No [待翻译] here

Workflow

Step 1: Prepare Structured Data

From web scraping or manual extraction:

python

article_data = {
    'title': 'Technical Article Title',
    'source_url': 'https://...',
    'sections': [...]
}

with open('article_data.json', 'w', encoding='utf-8') as f:
    json.dump(article_data, f, indent=2, ensure_ascii=False)

Step 2: Generate Bilingual Template

bash

uv run python generate_bilingual_md.py

Output example:

markdown

# Technical Article Title (中英对照)

> **原文链接:** https://...

---

## Introduction

[待翻译]

This article explains the concept...

[待翻译]

```python
def example():
    pass

Diagram

code


### Step 3: Fill in Translations

Manually or with AI assistance:
```markdown
## Introduction

## 简介

[待翻译]

This article explains the concept...

本文解释了这个概念...

Configuration

Translation Placeholder

Customize the placeholder text:

python

TRANSLATION_PLACEHOLDER = "[待翻译]"  # Chinese
# or
TRANSLATION_PLACEHOLDER = "[To be translated]"  # English

Section Handling

Configure which sections need translation:

python

TRANSLATABLE_TYPES = ['h1', 'h2', 'h3', 'p', 'blockquote']
NO_TRANSLATION_TYPES = ['code', 'image', 'table']

Image Deduplication

Enable/disable:

python

DEDUPLICATE_IMAGES = True  # Recommended for Medium articles

Output Formats

Standard Bilingual (Side-by-side)

markdown

Original text here.

翻译文本在这里。

Inline Bilingual

markdown

Original text here. (翻译文本在这里。)

Separate Files

python

# Generate two files
generate_original_only(data, 'article_en.md')
generate_translation_only(data, 'article_zh.md')

Validation

Before using the generated markdown:

• Check that all sections have translation placeholders
• Verify code blocks are preserved without placeholders
• Ensure images are not duplicated
• Confirm formatting (headers, lists, quotes) is correct
• Test that links and image paths are valid

Best Practices

•Preserve structure: Keep original formatting intact
•Clear placeholders: Use consistent, searchable placeholder text
•Code blocks: Never add translation placeholders inside code
•Image captions: Translate captions separately if needed
•Incremental work: Fill translations section by section
•Version control: Commit original template before translating

Integration with Translation Tools

Manual Translation

•Generate template with placeholders
•Use editor's find/replace to locate [待翻译]
•Replace with actual translation
•Keep original text above translation

AI-Assisted Translation

python

def translate_section(text: str, target_lang: str = 'zh') -> str:
    # Use OpenAI, DeepL, or other translation API
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{
            "role": "user",
            "content": f"Translate to {target_lang}: {text}"
        }]
    )
    return response.choices[0].message.content

Batch Translation

python

def fill_translations(md_path: Path):
    with open(md_path, 'r', encoding='utf-8') as f:
        content = f.read()
    
    # Find all sections needing translation
    sections = re.findall(r'(.*?)\n\n\[待翻译\]', content)
    
    for section in sections:
        translation = translate_section(section)
        content = content.replace(
            f"{section}\n\n[待翻译]",
            f"{section}\n\n{translation}"
        )
    
    with open(md_path, 'w', encoding='utf-8') as f:
        f.write(content)

Common Issues

Issue: Images appear multiple times

Solution: Enable hash-based deduplication:

python

last_image_hash = None
for section in sections:
    if section['type'] == 'image':
        current_hash = extract_hash(section['local_path'])
        if current_hash != last_image_hash:
            # Add image
            last_image_hash = current_hash

Issue: Code blocks have translation placeholders

Solution: Skip translation for code:

python

if section['type'] == 'code':
    md_lines.append(f"\n```{lang}\n{text}\n```\n")
    # Don't add [待翻译]

Issue: Special characters break formatting

Solution: Use proper encoding:

python

with open(output_path, 'w', encoding='utf-8') as f:
    f.write(content)

Reference Scripts

See .skills/learning-bilingual_content/scripts/ for implementation:

•generate_bilingual_md.py - Main generation script
•translate_batch.py - Batch translation helper