Unstructured PDF Generation
Generate realistic synthetic PDF documents using LLM for RAG (Retrieval-Augmented Generation) and unstructured data use cases.
Overview
This skill uses the generate_pdf_documents MCP tool to create professional PDF documents with:
- •LLM-generated content based on your description
- •Accompanying JSON files with questions and evaluation guidelines (for RAG testing)
- •Automatic upload to Unity Catalog Volumes
Quick Start
Use the generate_pdf_documents MCP tool:
- •
catalog: "my_catalog" - •
schema: "my_schema" - •
description: "Technical documentation for a cloud infrastructure platform including setup guides, troubleshooting procedures, and API references." - •
count: 10
This generates 10 PDF documents and saves them to /Volumes/my_catalog/my_schema/raw_data/pdf_documents/ (using default volume and folder).
With Custom Location
Use the generate_pdf_documents MCP tool:
- •
catalog: "my_catalog" - •
schema: "my_schema" - •
description: "HR policy documents..." - •
count: 10 - •
volume: "custom_volume" - •
folder: "hr_policies" - •
overwrite_folder: true
Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
catalog | string | Yes | - | Unity Catalog name |
schema | string | Yes | - | Schema name |
description | string | Yes | - | Detailed description of what PDFs should contain |
count | int | Yes | - | Number of PDFs to generate |
volume | string | No | raw_data | Volume name (created if not exists) |
folder | string | No | pdf_documents | Folder within volume for output files |
doc_size | string | No | MEDIUM | Document size: SMALL (~1 page), MEDIUM (~5 pages), LARGE (~10+ pages) |
overwrite_folder | bool | No | false | If true, deletes existing folder contents first |
Document Size Guide
- •SMALL: ~1 page, concise content. Best for quick demos or testing.
- •MEDIUM: ~4-6 pages, comprehensive coverage. Good balance for most use cases.
- •LARGE: ~10+ pages, exhaustive documentation. Use for thorough RAG evaluation.
Output Files
For each document, the tool creates two files:
- •PDF file (
<model_id>.pdf): The generated document - •JSON file (
<model_id>.json): Metadata for RAG evaluation
JSON Structure
{
"title": "API Authentication Guide",
"category": "Technical",
"pdf_path": "/Volumes/catalog/schema/volume/folder/doc_001.pdf",
"question": "What authentication methods are supported by the API?",
"guideline": "Answer should mention OAuth 2.0, API keys, and JWT tokens with their use cases."
}
Common Patterns
Pattern 1: HR Policy Documents
Use the generate_pdf_documents MCP tool:
- •
catalog: "ai_dev_kit" - •
schema: "hr_demo" - •
description: "HR policy documents for a technology company including employee handbook, leave policies, performance review procedures, benefits guide, and workplace conduct guidelines." - •
count: 15 - •
folder: "hr_policies" - •
overwrite_folder: true
Pattern 2: Technical Documentation
Use the generate_pdf_documents MCP tool:
- •
catalog: "ai_dev_kit" - •
schema: "tech_docs" - •
description: "Technical documentation for a SaaS analytics platform including installation guides, API references, troubleshooting procedures, security best practices, and integration tutorials." - •
count: 20 - •
folder: "product_docs" - •
overwrite_folder: true
Pattern 3: Financial Reports
Use the generate_pdf_documents MCP tool:
- •
catalog: "ai_dev_kit" - •
schema: "finance_demo" - •
description: "Financial documents for a retail company including quarterly reports, expense policies, budget guidelines, and audit procedures." - •
count: 12 - •
folder: "reports" - •
overwrite_folder: true
Pattern 4: Training Materials
Use the generate_pdf_documents MCP tool:
- •
catalog: "ai_dev_kit" - •
schema: "training" - •
description: "Training materials for new software developers including onboarding guides, coding standards, code review procedures, and deployment workflows." - •
count: 8 - •
folder: "courses" - •
overwrite_folder: true
Workflow
- •Ask for destination: Default to
ai_dev_kitcatalog, ask user for schema name - •Get description: Ask what kind of documents they need
- •Generate PDFs: Call
generate_pdf_documentsMCP tool with appropriate parameters - •Verify output: Check the volume path for generated files
Best Practices
- •
Detailed descriptions: The more specific your description, the better the generated content
- •BAD: "Generate some HR documents"
- •GOOD: "HR policy documents for a technology company including employee handbook covering remote work policies, leave policies with PTO and sick leave details, performance review procedures with quarterly and annual cycles, and workplace conduct guidelines"
- •
Appropriate count:
- •For demos: 5-10 documents
- •For RAG testing: 15-30 documents
- •For comprehensive evaluation: 50+ documents
- •
Folder organization: Use descriptive folder names that indicate content type
- •
hr_policies/ - •
technical_docs/ - •
training_materials/
- •
- •
Use overwrite_folder: Set to
truewhen regenerating to ensure clean state
Integration with RAG Pipelines
The generated JSON files are designed for RAG evaluation:
- •Ingest PDFs: Use the PDF files as source documents for your vector database
- •Test retrieval: Use the
questionfield to query your RAG system - •Evaluate answers: Use the
guidelinefield to assess if the RAG response is correct
Example evaluation workflow:
# Load questions from JSON files
questions = load_json_files(f"/Volumes/{catalog}/{schema}/{volume}/{folder}/*.json")
for q in questions:
# Query RAG system
response = rag_system.query(q["question"])
# Evaluate using guideline
is_correct = evaluate_response(response, q["guideline"])
Environment Configuration
The tool requires LLM configuration via environment variables:
# Databricks Foundation Models (default) LLM_PROVIDER=DATABRICKS DATABRICKS_MODEL=databricks-meta-llama-3-3-70b-instruct # Or Azure OpenAI LLM_PROVIDER=AZURE AZURE_OPENAI_ENDPOINT=https://your-resource.cognitiveservices.azure.com/ AZURE_OPENAI_API_KEY=your-api-key AZURE_OPENAI_DEPLOYMENT=gpt-4o
Common Issues
| Issue | Solution |
|---|---|
| "No LLM endpoint configured" | Set DATABRICKS_MODEL or AZURE_OPENAI_DEPLOYMENT environment variable |
| "Volume does not exist" | The tool creates volumes automatically; ensure you have CREATE VOLUME permission |
| "PDF generation timeout" | Reduce count or check LLM endpoint availability |
| Low quality content | Provide more detailed description with specific topics and document types |