PDF Parity Checker Skill
Purpose
Compare the 44 XHTML chapter files against their corresponding POD (print-on-demand) PDF files to ensure visual and structural consistency. This is critical for maintaining brand quality across digital and print editions.
When to Invoke
- •User asks "do the PDFs match the EPUB chapters?"
- •Before sending POD files to IngramSpark or print vendor
- •After making changes to XHTML or CSS
- •User mentions "print edition" or "PDF consistency"
- •User asks "verify the PDFs are up to date"
Workflow
Run PDF Parity Verification
python3 scripts/pdf_verify.py \ --root REBRANDED_OUTPUT \ --targets docs/REBRANDED_VISUAL_AUDIT.json \ --update-json
What it does:
- •For each of the 44 XHTML files:
- •Locates corresponding PDF in
REBRANDED_OUTPUT/pdf-pod/ - •Compares:
- •Page count (XHTML rendered vs PDF pages)
- •Media box dimensions (PDF page size)
- •First-page visual hash (downscaled grayscale comparison)
- •Text extraction and paragraph continuity
- •Locates corresponding PDF in
- •If PDF is missing:
- •Generates temporary reference PDF via headless browser print-to-PDF
- •Uses this for comparison (but does NOT commit to repo)
- •Flags as "MISSING" in report
- •Updates
docs/REBRANDED_VISUAL_AUDIT.jsonwith:- •
pdf_checkobject for each chapter - •Fields:
page_count_match,bbox_match,image_hash_delta,pdf_status
- •
Comparison Metrics
1. Page Count Match
Compares rendered XHTML page count vs PDF page count.
Example:
Chapter IX: "Unveiling Your Creative Odyssey" - XHTML rendered: 8 pages (at 6×9" print size) - PDF actual: 8 pages - Status: ✅ MATCH
Acceptable variance:
- •Exact match: ✅ PASS
- •±1 page: ⚠️ WARN (minor reflow difference)
- •±2+ pages: ❌ FAIL (significant layout mismatch)
2. Media Box (Page Size)
Verifies PDF pages are correct physical dimensions.
Expected for 6×9" POD:
- •Width: 432 points (6 inches × 72 DPI)
- •Height: 648 points (9 inches × 72 DPI)
Example:
Chapter XV: Media box check - Expected: 432×648 pt - Actual: 432×648 pt - Status: ✅ MATCH
3. Visual Hash Comparison
Computes perceptual hash of first page to detect visual differences.
Process:
- •Render XHTML first page as PNG (grayscale, downscaled to 200×300)
- •Convert PDF first page to PNG (same size)
- •Compute average hash for both
- •Calculate Hamming distance
Scoring:
- •Hash delta 0-5: ✅ IDENTICAL (perfect match)
- •Hash delta 6-15: ✅ SIMILAR (acceptable variance)
- •Hash delta 16-30: ⚠️ DIFFERENT (minor layout shift)
- •Hash delta >30: ❌ MISMATCH (significant visual difference)
Example:
Chapter IV: Visual hash comparison - XHTML hash: d4a3f2c1... - PDF hash: d4a3f2c1... - Hamming distance: 3 - Status: ✅ IDENTICAL
4. Text Extraction
Extracts text from PDF and verifies key content is present.
Checks:
- •Chapter title appears in first 500 characters
- •Heading order matches XHTML heading structure
- •Paragraph count is similar (±10%)
Example:
Chapter XII: Text extraction - Title found: ✅ "Financial Wisdom" - Headings: 12 in XHTML, 12 in PDF ✅ - Paragraphs: 84 in XHTML, 83 in PDF ✅ (within 10%) - Status: ✅ PASS
Interpreting Results
JSON Output Structure
{
"file": "REBRANDED_OUTPUT/xhtml/9-chapter-i-unveiling-your-creative-odyssey.xhtml",
"basename": "9-chapter-i-unveiling-your-creative-odyssey",
"pdf_check": {
"pdf_path": "REBRANDED_OUTPUT/pdf-pod/chapters/9-chapter-i-unveiling-your-creative-odyssey.pdf",
"pdf_status": "ok",
"page_count_match": true,
"page_count_xhtml": 8,
"page_count_pdf": 8,
"bbox_match": true,
"bbox_expected": [432, 648],
"bbox_actual": [432, 648],
"image_hash_delta": 3,
"image_hash_verdict": "identical",
"text_checks": {
"title_found": true,
"heading_count_match": true,
"paragraph_variance_pct": 1.2
}
}
}
Markdown Summary
Generated in docs/REBRANDED_VISUAL_AUDIT.md:
| File | PDF Status | Page Match | Visual Match | Issues |
|---|---|---|---|---|
| 9-chapter-i-... | ✅ OK | ✅ 8 pages | ✅ Identical | None |
| 15-chapter-vi-... | ⚠️ OK | ⚠️ 10 vs 11 | ✅ Similar | +1 page variance |
| 22-chapter-xii-... | ❌ MISSING | N/A | N/A | PDF not found |
Common Issues and Fixes
Issue: Page Count Mismatch
Symptom: XHTML renders as 8 pages, PDF has 9 pages
Possible causes:
- •Extra blank page in PDF (page break issue)
- •Different margin settings between XHTML and PDF export
- •Widow/orphan control differences
How to fix:
- •Open PDF in Acrobat to verify blank page
- •Adjust
print-pod.cssorphans/widows settings:cssp { orphans: 2; widows: 2; } - •Re-export PDF from InDesign or print-to-PDF workflow
- •Re-run parity check to verify
Issue: Visual Hash Mismatch
Symptom: Hash delta >30 (significant visual difference)
Possible causes:
- •Font substitution in PDF vs XHTML
- •Image resolution difference
- •Different CSS applied (print vs digital styles)
How to fix:
- •Open both side-by-side:
- •XHTML screenshot:
docs/screenshots/{basename}/title_1080x1440.png - •PDF page 1: Open in Acrobat, export as PNG
- •XHTML screenshot:
- •Identify visual difference (typography, spacing, images)
- •Adjust
print-pod.cssto match digital rendering - •Regenerate PDF
- •Re-run parity check
Issue: Missing PDF
Symptom: pdf_status: "missing"
Action:
- •Check if PDF was not generated yet for this chapter
- •Verify PDF naming matches XHTML basename:
- •XHTML:
9-chapter-i-unveiling-your-creative-odyssey.xhtml - •PDF should be:
9-chapter-i-unveiling-your-creative-odyssey.pdf
- •XHTML:
- •If PDF exists with different name, rename to match
- •If PDF doesn't exist:
- •Generate from XHTML using print-to-PDF workflow
- •Place in appropriate
pdf-pod/subdirectory
Issue: Text Extraction Failure
Symptom: text_checks.title_found: false
Possible causes:
- •PDF is image-only (scanned, not text-based)
- •PDF has font embedding issues (text not extractable)
- •PDF was generated with text as outlines/curves
How to fix:
- •Verify PDF is text-based (select text in Acrobat)
- •If image-only, regenerate PDF from source XHTML
- •Ensure fonts are embedded as subsets (not outlines)
Example Invocation
User: "Are the print PDFs in sync with the latest EPUB chapters?"
Agent response:
- •Run PDF parity checker
- •Read results from updated JSON
- •Summarize:
PDF Parity Check Complete ========================= Total chapters: 44 PDFs verified: 42 Missing PDFs: 2 Results: ✅ PASS: 38 chapters (perfect parity) ⚠️ WARN: 4 chapters (minor variance) ❌ FAIL: 0 chapters (significant mismatch) 🔍 MISSING: 2 chapters (PDFs not found) Warnings (4 chapters): - Chapter VI: +1 page variance (11 vs 10) - Extra blank page in PDF - Chapter XI: Visual hash delta 18 - Minor font rendering difference - Chapter XV: Text variance 12% - Some paragraphs reflowed - Chapter XXII: Media box 432×660 - Height should be 648 Missing PDFs (2 chapters): - 30-SelfAssessment.xhtml (no matching PDF found) - 43-DoodlePage.xhtml (no matching PDF found) Recommended actions: 1. Fix page break in Chapter VI 2. Review font settings for Chapter XI 3. Generate missing PDFs for Self-Assessment and Doodle pages 4. Verify media box for Chapter XXII Full report: docs/REBRANDED_VISUAL_AUDIT.md (PDF Parity column) Detailed JSON: docs/REBRANDED_VISUAL_AUDIT.json (pdf_check objects)
Integration with Other Skills
Run after:
- •
epub-visual-auditor- Ensure XHTML rendering is correct first
Run before:
- •Sending POD files to print vendor
- •Uploading to IngramSpark or KDP Print
- •Final publication package
Pair with:
- •
epub-publication-validator- Comprehensive pre-publication check
Notes
- •PDF comparison requires
pypdfandPillowPython libraries - •First run may be slower (generates temporary PDFs for missing files)
- •Temporary reference PDFs are stored in
/tmp/and not committed to repo - •Visual hash comparison is perceptual (small rendering differences are OK)
- •Re-run after any CSS or XHTML changes to verify parity maintained