PDF Parity Checker Skill

Purpose

Compare the 44 XHTML chapter files against their corresponding POD (print-on-demand) PDF files to ensure visual and structural consistency. This is critical for maintaining brand quality across digital and print editions.

When to Invoke

•User asks "do the PDFs match the EPUB chapters?"
•Before sending POD files to IngramSpark or print vendor
•After making changes to XHTML or CSS
•User mentions "print edition" or "PDF consistency"
•User asks "verify the PDFs are up to date"

Workflow

Run PDF Parity Verification

bash

python3 scripts/pdf_verify.py \
  --root REBRANDED_OUTPUT \
  --targets docs/REBRANDED_VISUAL_AUDIT.json \
  --update-json

What it does:

•
For each of the 44 XHTML files:
- •Locates corresponding PDF in REBRANDED_OUTPUT/pdf-pod/
- •
  Compares:
  - •Page count (XHTML rendered vs PDF pages)
  - •Media box dimensions (PDF page size)
  - •First-page visual hash (downscaled grayscale comparison)
  - •Text extraction and paragraph continuity
•
If PDF is missing:
- •Generates temporary reference PDF via headless browser print-to-PDF
- •Uses this for comparison (but does NOT commit to repo)
- •Flags as "MISSING" in report
•
Updates docs/REBRANDED_VISUAL_AUDIT.json with:
- •pdf_check object for each chapter
- •Fields: page_count_match, bbox_match, image_hash_delta, pdf_status

Comparison Metrics

1. Page Count Match

Compares rendered XHTML page count vs PDF page count.

Example:

code

Chapter IX: "Unveiling Your Creative Odyssey"
- XHTML rendered: 8 pages (at 6×9" print size)
- PDF actual: 8 pages
- Status: ✅ MATCH

Acceptable variance:

•Exact match: ✅ PASS
•±1 page: ⚠️ WARN (minor reflow difference)
•±2+ pages: ❌ FAIL (significant layout mismatch)

2. Media Box (Page Size)

Verifies PDF pages are correct physical dimensions.

Expected for 6×9" POD:

•Width: 432 points (6 inches × 72 DPI)
•Height: 648 points (9 inches × 72 DPI)

Example:

code

Chapter XV: Media box check
- Expected: 432×648 pt
- Actual: 432×648 pt
- Status: ✅ MATCH

3. Visual Hash Comparison

Computes perceptual hash of first page to detect visual differences.

Process:

•Render XHTML first page as PNG (grayscale, downscaled to 200×300)
•Convert PDF first page to PNG (same size)
•Compute average hash for both
•Calculate Hamming distance

Scoring:

•Hash delta 0-5: ✅ IDENTICAL (perfect match)
•Hash delta 6-15: ✅ SIMILAR (acceptable variance)
•Hash delta 16-30: ⚠️ DIFFERENT (minor layout shift)
•Hash delta >30: ❌ MISMATCH (significant visual difference)

Example:

code

Chapter IV: Visual hash comparison
- XHTML hash: d4a3f2c1...
- PDF hash: d4a3f2c1...
- Hamming distance: 3
- Status: ✅ IDENTICAL

4. Text Extraction

Extracts text from PDF and verifies key content is present.

Checks:

•Chapter title appears in first 500 characters
•Heading order matches XHTML heading structure
•Paragraph count is similar (±10%)

Example:

code

Chapter XII: Text extraction
- Title found: ✅ "Financial Wisdom"
- Headings: 12 in XHTML, 12 in PDF ✅
- Paragraphs: 84 in XHTML, 83 in PDF ✅ (within 10%)
- Status: ✅ PASS

Interpreting Results

JSON Output Structure

json

{
  "file": "REBRANDED_OUTPUT/xhtml/9-chapter-i-unveiling-your-creative-odyssey.xhtml",
  "basename": "9-chapter-i-unveiling-your-creative-odyssey",
  "pdf_check": {
    "pdf_path": "REBRANDED_OUTPUT/pdf-pod/chapters/9-chapter-i-unveiling-your-creative-odyssey.pdf",
    "pdf_status": "ok",
    "page_count_match": true,
    "page_count_xhtml": 8,
    "page_count_pdf": 8,
    "bbox_match": true,
    "bbox_expected": [432, 648],
    "bbox_actual": [432, 648],
    "image_hash_delta": 3,
    "image_hash_verdict": "identical",
    "text_checks": {
      "title_found": true,
      "heading_count_match": true,
      "paragraph_variance_pct": 1.2
    }
  }
}

Markdown Summary

Generated in docs/REBRANDED_VISUAL_AUDIT.md:

File	PDF Status	Page Match	Visual Match	Issues
9-chapter-i-...	✅ OK	✅ 8 pages	✅ Identical	None
15-chapter-vi-...	⚠️ OK	⚠️ 10 vs 11	✅ Similar	+1 page variance
22-chapter-xii-...	❌ MISSING	N/A	N/A	PDF not found

Common Issues and Fixes

Issue: Page Count Mismatch

Symptom: XHTML renders as 8 pages, PDF has 9 pages

Possible causes:

•Extra blank page in PDF (page break issue)
•Different margin settings between XHTML and PDF export
•Widow/orphan control differences

How to fix:

•Open PDF in Acrobat to verify blank page
•
Adjust print-pod.css orphans/widows settings:
css
```
p { orphans: 2; widows: 2; }
```
•Re-export PDF from InDesign or print-to-PDF workflow
•Re-run parity check to verify

Issue: Visual Hash Mismatch

Symptom: Hash delta >30 (significant visual difference)

Possible causes:

•Font substitution in PDF vs XHTML
•Image resolution difference
•Different CSS applied (print vs digital styles)

How to fix:

•
Open both side-by-side:
- •XHTML screenshot: docs/screenshots/{basename}/title_1080x1440.png
- •PDF page 1: Open in Acrobat, export as PNG
•Identify visual difference (typography, spacing, images)
•Adjust print-pod.css to match digital rendering
•Regenerate PDF
•Re-run parity check

Issue: Missing PDF

Symptom: pdf_status: "missing"

Action:

•Check if PDF was not generated yet for this chapter
•
Verify PDF naming matches XHTML basename:
- •XHTML: 9-chapter-i-unveiling-your-creative-odyssey.xhtml
- •PDF should be: 9-chapter-i-unveiling-your-creative-odyssey.pdf
•If PDF exists with different name, rename to match
•
If PDF doesn't exist:
- •Generate from XHTML using print-to-PDF workflow
- •Place in appropriate pdf-pod/ subdirectory

Issue: Text Extraction Failure

Symptom: text_checks.title_found: false

Possible causes:

•PDF is image-only (scanned, not text-based)
•PDF has font embedding issues (text not extractable)
•PDF was generated with text as outlines/curves

How to fix:

•Verify PDF is text-based (select text in Acrobat)
•If image-only, regenerate PDF from source XHTML
•Ensure fonts are embedded as subsets (not outlines)

Example Invocation

User: "Are the print PDFs in sync with the latest EPUB chapters?"

Agent response:

•Run PDF parity checker
•Read results from updated JSON
•Summarize:

code

PDF Parity Check Complete
=========================

Total chapters: 44
PDFs verified: 42
Missing PDFs: 2

Results:
✅ PASS: 38 chapters (perfect parity)
⚠️ WARN: 4 chapters (minor variance)
❌ FAIL: 0 chapters (significant mismatch)
🔍 MISSING: 2 chapters (PDFs not found)

Warnings (4 chapters):
- Chapter VI: +1 page variance (11 vs 10) - Extra blank page in PDF
- Chapter XI: Visual hash delta 18 - Minor font rendering difference
- Chapter XV: Text variance 12% - Some paragraphs reflowed
- Chapter XXII: Media box 432×660 - Height should be 648

Missing PDFs (2 chapters):
- 30-SelfAssessment.xhtml (no matching PDF found)
- 43-DoodlePage.xhtml (no matching PDF found)

Recommended actions:
1. Fix page break in Chapter VI
2. Review font settings for Chapter XI
3. Generate missing PDFs for Self-Assessment and Doodle pages
4. Verify media box for Chapter XXII

Full report: docs/REBRANDED_VISUAL_AUDIT.md (PDF Parity column)
Detailed JSON: docs/REBRANDED_VISUAL_AUDIT.json (pdf_check objects)

Integration with Other Skills

Run after:

•epub-visual-auditor - Ensure XHTML rendering is correct first

Run before:

•Sending POD files to print vendor
•Uploading to IngramSpark or KDP Print
•Final publication package

Pair with:

•epub-publication-validator - Comprehensive pre-publication check

Notes

•PDF comparison requires pypdf and Pillow Python libraries
•First run may be slower (generates temporary PDFs for missing files)
•Temporary reference PDFs are stored in /tmp/ and not committed to repo
•Visual hash comparison is perceptual (small rendering differences are OK)
•Re-run after any CSS or XHTML changes to verify parity maintained