UNV+SN Backparse Skill (Specification v1.8)
This skill parses Chinese Union Version (UNV) biblical text with Strong's Numbers into structured semantic groups according to SPECIFICATION_v1.8.md.
v1.7 Feature: Automatic compound preposition detection - resolves 94% of qb_qp_mismatch errors by detecting מִן (04480) compounds directly from qp.php wform field.
v1.7.2 Enhancement: Multi-token compound detection - automatically skips 900x prefixes to detect complex compounds like <04480><09001><06440> (מִלִּפְנֵי).
v1.8 Enhancement: Generic compound detection - supports all types of compound prepositions, not just מִן. Detects לִפְנֵי (<09001><06440>) and other 900x-starting compounds by checking both wform and remark fields in qp.php.
When to Use This Skill
Activate this skill when the user:
- •Requests parsing of specific biblical verses (e.g., "parse Genesis 1:1", "parse all verses in Exodus 2")
- •Wants to batch process a range of verses
- •Needs to analyze Strong's number groupings in UNV text
- •Asks about the parsing results or output format
- •Requests verification of parsed output
Core Architecture
The system uses a three-stage pipeline:
- •Data Retrieval (
fetch_text.sh) - Fetches from FHL API endpoints - •Parsing (
parse_verse_v1_8.pyviarun_parser_temp.py) - Transforms raw data into structured groups- •v1.7: Automatically detects compound prepositions from qp.php wform
- •v1.8: Generic detection for all compound types
- •Output Generation - Saves to
output/{Book}/{Chapter}/{verse}(text format)
Parsing Workflow
Single Verse Parsing
# Parse a single verse (e.g., Genesis 1:1) python run_parser_temp.py 1 1 # View output without writing to disk python run_parser_temp.py --no-write 1 1
Batch Parsing
Follow the Batch_Parsing_SOP.md workflow:
- •
Create Output Directories
bashmkdir -p output/{Book}/{Chapter}/ - •
Determine Verse Range
- •If only book is provided, automatically determine starting point
- •For existing book: continue from last processed verse
- •For new book: start from chapter 1, verse 1
- •
Iterate and Process Each Verse
bashfor verse in {START..END}; do python run_parser_temp.py {chapter} $verse done - •
Handle Uncertainty
- •Files with ambiguity are named
{verse}_uncertain - •Append
--- UNCERTAINTY NOTES ---section describing issues
- •Files with ambiguity are named
- •
Verification
- •Check all output files were created
- •Verify no unexpected
_uncertainfiles - •Spot-check sample outputs for correctness
Output Format (UNV_SN_Output_Format.md)
Each parsed verse contains three sections:
I. Parsed and Formatted Text Section
Traditional Chinese table format with:
- •Individual Strong's numbers:
<NNNN> — [詞性]「[中文意義]」 - •With morphology:
<NNNN>(8xxx) — [詞性]「[中文意義]」 *N - •Grouped numbers:
<NNNN><MMMM> — [詞性]「[中文意義]」
II. Raw UNV+SN Source Text Section
Original bible_text with WH/WTH/WAH prefixes preserved
III. Morphology Notes Section
Detailed grammatical explanations: *N: [詳細描述]
Key Parsing Rules (SPECIFICATION_v1.8.md)
v1.8 Generic Compound Preposition Detection
All compound prepositions are now detected automatically, not just מִן (04480).
v1.7.2 Multi-Token Compound Detection
Automatic Detection from qp.php (v1.7.2 Enhanced):
- •When
<04480>(מִן) appears in qb.php but not in qp.php, parser checks the next core token's qp.php record - •v1.7.2: Automatically skips intervening 900x prefixes to find the core token
- •If core token's
wformcontains "介系詞 מִן +" pattern, all involved tokens are merged into compound preposition - •Supports multi-token compounds like
<04480><09001><06440>(מִלִּפְנֵי) - •No dictionary file needed - information extracted directly from qp.php wform field
Output Format:
<04480><05921> — 複合介系詞 מֵעַל「在…上面、在旁邊、關於、敵對、攻擊」 [註]: 介系詞 מִן + 介系詞 עַל
Common Compounds:
- •
<04480><05921>→ מֵעַל "from above" (מִן compound) - •
<04480><08478>→ מִתַּחַת "from under" (מִן compound) - •
<09001><06440>→ לִפְנֵי "before" (v1.8 new: 900x compound) - •
<04480><09001><06440>→ מִלִּפְנֵי "from before" (v1.7.2: multi-token)
Impact:
- •v1.7: Resolves ~1,097 qb_qp_mismatch errors (94% of מִן-related errors)
- •v1.8: Additionally resolves לִפְנֵי and other 900x-starting compounds
Token Classification
Three distinct token types with non-overlapping ranges:
- •Core (Strong's):
<dddd>or{<dddd>}- Numbers 1-8999 (excluding 8xxx, 09xxx) - •Morphology (8xxx):
(**8ddd),{8ddd}- 4-digit codes 8000-8999, verbal stems and tenses - •Prefixes (900x):
<09ddd>- 5-digit codes 09000-09999 only, inseparable particles (ל־, ב־, ה־, etc.)- •IMPORTANT: 4-digit numbers like
<0914>are NOT prefixes (must be 5 digits starting with 09)
- •IMPORTANT: 4-digit numbers like
Normalization (§3.1)
MUST perform before parsing:
- •Remove
WH/WTH/WAHinternal prefixes - •Convert
<WTH8xxx>to(**8xxx)(morphology codes) - •Preserve
<09ddd>as 900x prefixes - •Recognize
{<dddd>}as implicit core,{8xxx}as implicit morph
Grouping Rules (§3.3)
Scan Direction: Left-to-right, ignoring punctuation/whitespace
- •
Prefix Attachment: 900x codes enter
prefix_buffer, skip over{<...>}and{8xxx}, attach to next core token - •
Morphology Attachment: Always left-attach to most recent core group
- •
Brace Preposition Decision Tree (for
{<PREP>}where PREP in["05921","04480","0413","00996"]):- •Exception 1 (Highest Priority): If
qp.wformshows pronoun suffix OR infinitive complement → left-attach to verb (post_brace) - •Exception 2:
{<0853>}(object marker אֵת) → always right-attach to noun (pre_brace) - •General Case: If right-side token (skipping 900x) is noun → right-attach (
pre_brace); else independent group with warning
- •Exception 1 (Highest Priority): If
- •
Construct Linker (optional v1.2-B): Link construct state nouns to following nouns using
construct_of
Configuration Profile (§4.1)
Hardcoded in parse_verse_v1_7.py:
PROFILE = {
"brace_preps": ["05921", "04480", "0413", "00996"], # עַל, מִן, אֶל, בֵּין
"object_marker": "0853", # אֵת
"ignored_codes": ["09015"], # Paragraph markers
# v1.7 new configuration
"detect_compounds_from_qp": True, # Detect compounds from qp.php wform
"merge_prep_plus_prep": True, # Merge prep+prep compounds
"merge_prep_plus_noun": False, # Optional: merge prep+noun
}
Data Sources
FHL API (bible.fhl.net):
- •
qb.php: UNV text with Strong's numbers (requires Chinese book abbreviations) - •
qp.php: Parsing/morphology data (requires English book abbreviations)
Book Mappings: 66 books with bidirectional lookup (Gen ↔ 創, Matt ↔ 太, etc.)
Error Handling
Uncertainty Detection
Mark files as {verse}_uncertain when:
- •Strong's number from
qb.phpmissing inqp.php - •Ambiguous brace preposition attachment
- •Data inconsistencies between APIs
- •Unresolvable grouping decisions
Warning Types
Add to warnings[] array:
- •
brace_attach_ambiguous: Cannot determine preposition attachment - •
dangling_900x: Prefix without core token - •
morph_without_core: Orphaned morphology code - •
qb_qp_core_mismatch: Data mismatch between APIs
Issue Logging (New Feature)
The parser automatically logs issues to eight files in output/:
- •
strong_number_from_qb.php_not_found_in_qp.php.txt (NEW in v1.8.1)
- •Dedicated log for Strong's number mismatches between qb.php and qp.php
- •These are numbers that appear in UNV text but have no corresponding qp.php morphology record
- •Includes KJV cross-reference to help determine if issue is UNV-specific or broader
- •Example:
Strong's number <03212> from qb.php not found in qp.php records. | KJV also uses <03212> - •Important: This separates the most common issue type (347 entries in Gen+Exod) from other uncertain cases
- •
dangling_prefixes.txt (NEW in v1.8.1)
- •Dedicated log for 900x prefix translation artifacts (74 cases in Gen+Exod)
- •These are FHL data encoding artifacts where Chinese translation adds prepositions not present as independent Strong's numbers in Hebrew
- •Not parser errors - correctly identified data quality issues
- •See
dangling_prefixes.mdfor comprehensive analysis - •Example:
[2025-11-25] Gen 3:16 | dangling_p900x | 900x prefix <09002> had no following Strong's number to attach to.
- •
dangling_brace_preps.txt (NEW in v1.8.2)
- •Dedicated log for brace preposition translation artifacts (12 cases in Gen+Exod)
- •Implicit prepositions
{<0413>},{<05921>},{<04480>}at syntactic boundaries without suitable attachment points - •Similar to dangling_prefixes but for brace prepositions
- •Not parser errors - FHL data encoding where Chinese translation simplifies original syntax
- •See
dangling_brace_preps.mdfor comprehensive analysis - •Example:
[2025-11-25] Gen 19:5 | dangling_brace_prep | Brace preposition <0413> had no suitable attachment point.
- •
dangling_object_markers.txt (NEW in v1.8.3)
- •Dedicated log for object marker translation artifacts (19 cases in Gen+Exod)
- •Implicit object markers
{<0853>}(אֵת) in sentence-final, appositive, or coordinated object structures - •Hebrew uses explicit את to mark definite direct objects, Chinese omits or fuses into verbs
- •Not parser errors - FHL data encoding of translation-source structural differences
- •See
dangling_object_markers.mdfor comprehensive analysis - •Example:
[2025-11-25] Gen 22:12 | dangling_object_marker | Object marker <0853> had no suitable noun to attach to.
- •
uncertain_or_expandable_issues.txt
- •Issues that cannot be resolved with confidence (excluding qb_qp_mismatch, dangling_p900x, dangling_brace_prep, and dangling_object_marker)
- •Cases requiring spec expansion or manual review
- •Logged issue types:
brace_attach_ambiguous, otherdangling_*types - •v1.8.3 changes: No longer contains
dangling_object_marker(moved to dedicated file) - now empty or near-empty
- •
compatible_but_notable_issues.txt
- •Successfully parsed cases worth special attention
- •Edge cases, unusual constructions, multiple valid interpretations
- •Helps identify patterns for quality assurance and future spec refinements
- •
compound_prep_plus_noun.txt (NEW in v1.7)
- •Prep+noun compounds detected but not merged (per
merge_prep_plus_noun: Falseconfig) - •FHL data encoding artifacts where qb.php splits מִן but qp.php shows compound
- •These are NOT parsing errors - they reflect intentional design choice
- •Example:
<04480><03605>= מִכָּל "from all" (מִן + כֹּל) - •v1.7.1 fix: Correctly filters these from
uncertain_or_expandable_issues.txtto avoid duplicate logging
- •Prep+noun compounds detected but not merged (per
- •
qp_data_type_errors.txt (NEW in v1.8.4)
- •Dedicated log for qp.php data type inconsistencies
- •Edge cases where qp.php returns unexpected data types (e.g.,
snfield as list instead of string) - •Parser now handles these gracefully instead of crashing
- •Example:
[2025-11-26] Gen 2:16 | qp_sn_is_list | qp.php 'sn' field is list instead of string: ['01234', '05678']
Log Format: [timestamp] verse_ref | issue_type | description
Example Entries:
[2025-11-25 01:57:42] Gen 3:14 | qb_qp_mismatch | Strong's number <03212> from qb.php not found in qp.php records. | KJV also uses <03212> [2025-11-24 23:40:07] Gen 2:2 | prep_noun_compound | Prep+noun compound detected: <04480><03605> = מִכָּל (介系詞 מִן + 名詞,單陽附屬形) - not merged per config [2025-11-25 01:50:43] Gen 3:16 | dangling_p900x | 900x prefix <09002> had no following Strong's number to attach to.
Important User Presentation Rules
CRITICAL: When showing parsed verse results to the user:
- •Present all three sections in order (Parsed Text → Raw Source → Morphology Notes)
- •Display sections as-is with no inserted commentary
- •After all three sections, you MAY add English explanations if helpful
- •DO NOT insert English translations or bullet points within the sections
- •DO NOT add commentary between sections
Testing Strategy
Verified Test Cases (SPECIFICATION_v1.8.md §7):
- •Gen 1:2 - Brace preposition right-attach + construct state
- •Gen 1:4 - Object marker handling with multiple
{<0853>} - •Gen 1:5 - FHL profile mapping with inferred vs explicit prefixes
- •Gen 3:5 - Verb left-attach exception for infinitive complement
- •Gen 4:16 (v1.7.2) - Multi-token compound
<04480><09001><06440>detection - •Gen 6:11 (v1.8) - לִפְנֵי compound
<09001><06440>detection
Validate parsed output against expected groupings in spec §7.
Common Commands
# Fetch verse data
./fetch_text.sh --engs Gen --chap 1 --sec 1
# Parse single verse
python run_parser_temp.py 1 1
# Batch parse chapter (e.g., Genesis 2, verses 1-25)
mkdir -p output/Gen/2
for verse in {1..25}; do python run_parser_temp.py 2 $verse; done
# Verify outputs
ls -1 output/Gen/2/ | wc -l
ls -1 output/Gen/2/ | grep "_uncertain"
# View sample output
cat output/Gen/2/1
# Check issue logs
tail -20 output/strong_number_from_qb.php_not_found_in_qp.php.txt
tail -20 output/uncertain_or_expandable_issues.txt
tail -20 output/compatible_but_notable_issues.txt
tail -20 output/compound_prep_plus_noun.txt
# Search for specific verse in logs
grep "Gen 3:14" output/strong_number_from_qb.php_not_found_in_qp.php.txt
grep "Gen 3:16" output/uncertain_or_expandable_issues.txt
grep "Gen 1:2" output/compatible_but_notable_issues.txt
grep "Gen 2:2" output/compound_prep_plus_noun.txt
# Count log entries
wc -l output/strong_number_from_qb.php_not_found_in_qp.php.txt
wc -l output/uncertain_or_expandable_issues.txt
wc -l output/compound_prep_plus_noun.txt
Files and Dependencies
Core Files:
- •
SPECIFICATION_v1.8.md- Authoritative parsing rules (standalone, includes all previous versions) - •
Batch_Parsing_SOP.md- Batch processing workflow - •
UNV_SN_Output_Format_Gen_1_1.md- Output format specification - •
fetch_text.sh- API wrapper script - •
parse_verse_v1_8.py- Current parser (outputs text format) - •
run_parser_temp.py- Batch orchestrator
Dependencies: curl, jq, Python 3
Step-by-Step Execution Guide
When user requests parsing:
- •Acknowledge Request: Confirm book, chapter, verse range
- •Create TodoList: Track directory creation, parsing, verification
- •Create Directories:
mkdir -p output/{Book}/{Chapter}/ - •Execute Parsing: Run appropriate batch command or single verse
- •Verify Results: Check file count, look for
_uncertainfiles - •Show Samples: Display 1-2 sample outputs for user review
- •Report Completion: Confirm range processed and any issues found
Example Execution
User: "Parse Genesis chapter 3"
1. Create todo list with 4 items
2. mkdir -p output/Gen/3
3. Determine verse count (24 verses in Genesis 3)
4. Run: for verse in {1..24}; do python run_parser_temp.py 3 $verse; done
5. Verify: ls -1 output/Gen/3/ | wc -l (should show 24)
6. Check: ls -1 output/Gen/3/ | grep "_uncertain" (ideally 0)
7. Display: Show sample parsed output from verse 1 and 16
8. Report: "Successfully parsed Genesis 3:1-24 (24 verses, 0 uncertain)"
Notes
- •Parser is in partial implementation status; full v1.7.2 features may need enhancement
- •Always consult SPECIFICATION_v1.7.2.md for authoritative rules
- •Output location:
output/{Book}/{Chapter}/{verse}or{verse}_uncertain - •This is a subdirectory of the larger Strong's Number Embedding Project