Qdrant Chunk Retriever
This skill helps users retrieve and inspect chunks from specific PDF documents stored in the Qdrant vector database using the UTIL/retrieve_chunks_by_filename.py script.
When to Use This Skill
Activate this skill automatically when the user:
- •Wants to view/inspect chunks from a specific PDF file
- •Needs to debug chunk content or indexing
- •Asks to "show me chunks from [filename]"
- •Wants to examine how a document was chunked
- •Needs to verify chunk context or metadata
- •Asks about chunk content, chunk indices, or chunk details
- •Uses keywords like "retrieve chunks", "show chunks", "inspect document chunks"
How to Use
Step 1: Identify the Request
Determine what the user wants to retrieve:
- •All chunks from a file
- •A specific chunk by index
- •Chunks saved to JSON
- •Full text vs. preview
Step 2: Build the Command
The script is located at UTIL/retrieve_chunks_by_filename.py and supports these options:
Basic Usage (all chunks from a file):
cd UTIL python retrieve_chunks_by_filename.py --filename "document-name.pdf"
Retrieve specific chunk:
python retrieve_chunks_by_filename.py --filename "document.pdf" --chunk 5
Control text preview length:
python retrieve_chunks_by_filename.py --filename "document.pdf" --text-length 1000 # Or show full text: python retrieve_chunks_by_filename.py --filename "document.pdf" --text-length -1
Save to JSON file:
python retrieve_chunks_by_filename.py --filename "document.pdf" --output chunks.json
Quiet mode (for JSON export only):
python retrieve_chunks_by_filename.py --filename "document.pdf" --output chunks.json --quiet
Step 3: Run the Command
Execute the script with the appropriate options based on the user's request.
Step 4: Interpret and Present Results
The script outputs:
- •Chunk index (e.g., "Chunk 5/10")
- •Page number from the original PDF
- •Filename and Point ID (Qdrant internal)
- •Text content (truncated or full based on --text-length)
- •Context fields (if available):
- •
document_context: Overall document summary - •
chunk_context: Previous chunk context for continuity
- •
Present the results to the user, highlighting:
- •Number of chunks found
- •Chunk content preview or full text
- •Any context information available
- •Suggested next steps (if debugging)
Command-Line Options Reference
| Option | Short | Description | Default |
|---|---|---|---|
--filename | - | PDF filename to retrieve chunks from | bcy-26-income-eligibility-and-maximum-psoc-twc.pdf |
--chunk | -c | Specific chunk index (0-indexed) | All chunks |
--collection | - | Qdrant collection name | From config |
--text-length | - | Max characters to display (-1 for full) | 500 |
--output | - | Save to JSON file | None (console only) |
--quiet | - | Suppress console output | False |
Example Workflows
Example 1: View All Chunks from a Document
User: "Show me all chunks from child-care-services-guide-twc.pdf" Claude: I'll retrieve all chunks from that document. [Runs command] cd UTIL python retrieve_chunks_by_filename.py --filename "child-care-services-guide-twc.pdf" ✓ Retrieved 15 chunks from 'child-care-services-guide-twc.pdf' The document contains 15 chunks: - Chunk 0: Page 1 - [preview of text] - Chunk 1: Page 2 - [preview of text] ... Would you like to see the full text of any specific chunk?
Example 2: Inspect Specific Chunk
User: "What's in chunk 5 of the payment rates PDF?" Claude: Let me retrieve chunk 5 from the payment rates document. [Identifies filename, runs command] cd UTIL python retrieve_chunks_by_filename.py --filename "bcy25-child-care-provider-payment-rates-twc.pdf" --chunk 5 --text-length -1 Chunk 5/12 - Page 3 Filename: bcy25-child-care-provider-payment-rates-twc.pdf Full text: [Complete chunk text shown] This chunk contains information about [summary of content].
Example 3: Debug Chunk Context
User: "I'm getting weird results for family of 5. Can you check what's in the chunks about that?" Claude: Let me retrieve chunks from the income eligibility document to examine the context. [Runs command] cd UTIL python retrieve_chunks_by_filename.py --filename "bcy-26-income-eligibility-and-maximum-psoc-twc.pdf" --text-length -1 [Reviews output and identifies relevant chunks] I found the "family of 5" information in chunk 3. The chunk includes: - Main text: [relevant content] - Chunk context: [previous chunk summary for continuity] - Document context: [overall document summary] The issue appears to be [explanation based on chunk analysis].
Example 4: Export Chunks to JSON
User: "Can you export all chunks from the state plan document to a JSON file?" Claude: I'll retrieve all chunks and save them to JSON. cd UTIL python retrieve_chunks_by_filename.py --filename "tx-ccdf-state-plan-ffy2025-2027-approved.pdf" --output state_plan_chunks.json ✓ Saved 87 chunks to state_plan_chunks.json The JSON file contains: - Metadata (filename, total chunks, collection name, timestamp) - All chunk data (text, page, index, context fields) File saved to: UTIL/state_plan_chunks.json
Understanding the Output
Console Output Format
================================================================================ RETRIEVED CHUNKS: 10 total ================================================================================ Chunk 0/10 - Page 1 Filename: document.pdf Point ID: 123e4567-e89b-12d3-a456-426614174000 -------------------------------------------------------------------------------- Text: [Chunk text content here...] -------------------------------------------------------------------------------- Document Context: [Summary of the entire document] Chunk Context: [Summary of previous chunk for continuity] -------------------------------------------------------------------------------- [More chunks...]
JSON Output Format
{
"metadata": {
"filename": "document.pdf",
"total_chunks": 10,
"retrieved_at": "2025-01-15T10:30:00",
"collection": "tro-child-1"
},
"chunks": [
{
"id": "point-id",
"chunk_index": 0,
"total_chunks": 10,
"page": 1,
"text": "chunk content...",
"filename": "document.pdf",
"source_url": "https://...",
"has_context": true,
"master_context": "...",
"document_context": "...",
"chunk_context": "..."
}
]
}
Debugging Use Cases
Use Case 1: Verify Chunk Splitting
Check how a document was chunked and if chunks are appropriately sized:
python retrieve_chunks_by_filename.py --filename "doc.pdf" --text-length -1
Use Case 2: Investigate Missing Information
If retrieval isn't finding expected content, examine chunks to verify the text is present:
python retrieve_chunks_by_filename.py --filename "doc.pdf" | grep -i "search term"
Use Case 3: Check Context Fields
Verify that contextual embeddings include proper context:
python retrieve_chunks_by_filename.py --filename "doc.pdf" --chunk 5 --text-length -1
Look for document_context and chunk_context fields in output.
Use Case 4: Export for Analysis
Save chunks to JSON for external analysis or comparison:
python retrieve_chunks_by_filename.py --filename "doc.pdf" --output analysis.json
Error Handling
Filename Not Found
If no chunks are found:
- •Verify the exact filename (case-sensitive, include .pdf extension)
- •Suggest running
verify_qdrant.pyto list all documents - •Check if the document was loaded successfully
Connection Errors
If Qdrant connection fails:
- •Verify QDRANT_API_URL and QDRANT_API_KEY environment variables
- •Check network connectivity
- •Confirm collection name is correct
Invalid Chunk Index
If requesting a chunk that doesn't exist:
- •First retrieve all chunks to see the valid range
- •Remind user that chunk indices are 0-indexed
Notes
- •Default filename: If no filename is specified, uses
bcy-26-income-eligibility-and-maximum-psoc-twc.pdf - •Chunk ordering: Chunks are automatically sorted by
chunk_indexto maintain document order - •Text truncation: Default shows 500 characters; use --text-length -1 for full text
- •Collection: Defaults to
QDRANT_COLLECTION_NAME_CONTEXTUALfrom config - •Efficient retrieval: Uses Qdrant scroll API with filtering for performance
Related Tools
- •
UTIL/delete_documents.py: Delete documents from Qdrant - •
LOAD_DB/verify_qdrant.py: List all documents and statistics - •
LOAD_DB/reload_single_pdf.py: Reload a single PDF document