Import Vocabulary from Word Documents
This skill extracts Arabic vocabulary from .docx files and imports them into the flashcard system.
Workflow Overview
- •Extract → Parse .docx with pandoc
- •Generate → Create markdown file with extracted vocabulary
- •Confirm → Ask user which words to import
- •Import → Create cards via API
- •Archive → Move files to
processed/folder
Step 1: Extract Text from Document
Use pandoc to convert the Word document to markdown:
bash
pandoc "path/to/document.docx" -t markdown
Step 2: Parse and Generate Markdown
From the extracted content, identify vocabulary entries and create a markdown file in the same directory as the source document.
Filename pattern: {original_name}.vocab.md
Example: صفات.docx → صفات.vocab.md
Markdown format:
markdown
# Vocabulary: صفات (Adjectives) Source: صفات.docx Extracted: 2024-01-15 Category: Adjectives ## Words | # | Arabic | English | Notes | Import | |---|--------|---------|-------|--------| | 1 | جديد / جديدة | new | masc/fem | ✓ | | 2 | قديم / قديمة | old | masc/fem | ✓ | | 3 | ثقيل / ثقيلة | heavy | masc/fem | ✓ | | 4 | خفيف / خفيفة | light | masc/fem | ✓ | ## Summary - Total words found: 12 - Ready to import: 12 - Skipped: 0
Parsing rules:
- •Strip
{dir="rtl"}markers from Arabic text - •Ignore image references like
 - •Skip exercise blanks (____), instructions, non-vocabulary content
- •Handle masculine/feminine pairs: "جديد- جديدة" → "جديد / جديدة"
- •Preserve Arabic diacritics (tashkeel)
Step 3: Ask for Confirmation
Present the generated markdown to the user and ask:
code
I've extracted 12 vocabulary items from "صفات.docx" Category detected: Adjectives (صفات) Target deck: [New deck "Adjectives (صفات)" / Existing deck "..."] Preview: 1. جديد / جديدة - new 2. قديم / قديمة - old 3. ثقيل / ثقيلة - heavy ... Would you like to: - Import all words - Review the full list in صفات.vocab.md first - Skip certain words (specify which)
Step 4: Check/Create Deck
Check existing decks:
bash
curl -s http://localhost:3001/api/decks
Determine deck:
- •Check filename for category hints:
- •"صفات" = Adjectives
- •"أسماء" = Nouns
- •"أفعال" = Verbs
- •"definite article" = Grammar
- •Match against existing deck names
- •If no match: create new deck or use "Import YYYY-MM-DD"
Create new deck (if needed):
bash
curl -X POST "http://localhost:3000/api/decks" \
-H "Content-Type: application/json" \
-d '{"name": "Adjectives (صفات)", "description": "Imported from صفات.docx"}'
Step 5: Import Cards via API
bash
curl -X POST "http://localhost:3000/api/decks/{deck_id}/cards" \
-H "Content-Type: application/json" \
-d '[
{"front": "جديد / جديدة", "back": "new", "notes": "masc/fem"},
{"front": "قديم / قديمة", "back": "old", "notes": "masc/fem"}
]'
Step 6: Archive to Processed Folder
After successful import, move both files to a processed/ subfolder:
bash
# Create processed folder if it doesn't exist mkdir -p "path/to/processed" # Move the original docx mv "path/to/document.docx" "path/to/processed/" # Move the generated vocab markdown mv "path/to/document.vocab.md" "path/to/processed/"
Folder structure after processing:
code
example-arabic-docs/ ├── processed/ │ ├── صفات.docx │ ├── صفات.vocab.md │ ├── الأسماء.docx │ └── الأسماء.vocab.md └── new-document.docx (not yet processed)
Step 7: Report Results
code
✓ Imported 12 cards to deck "Adjectives (صفات)" (ID: 5) Files archived: → processed/صفات.docx → processed/صفات.vocab.md View deck: http://localhost:3000/deck/5
Error Handling
- •If dev server isn't running: prompt user to run
npm run dev - •If API call fails: keep files in original location, report error
- •If no vocabulary found: create empty .vocab.md noting "No vocabulary extracted"
Notes
- •The .vocab.md files serve as a permanent record of what was extracted
- •User can edit .vocab.md before confirming import if needed
- •The processed/ folder makes it easy to see what's been imported
- •WebSearch can be used to look up unclear translations