DocStrange by Nanonets
DocStrange is Nanonets' document extraction API. Use it to convert documents to markdown, extract structured JSON fields, and extract tables as CSV.
What This Skill Does
DocStrange uses advanced AI models to extract structured content from documents:
- •Markdown: Convert PDFs, images, Word docs, Excel to clean markdown
- •JSON: Extract specific fields with confidence scores (0-100)
- •CSV: Extract table data
- •HTML: Formatted HTML output
- •Image Analysis: OCR for scanned documents and images
- •Sync/Async: Sync for <=5 pages, async for larger documents
Supported Input Formats: PDF, JPG, JPEG, PNG, TIFF, Word (.docx), Excel (.xlsx)
Base API: https://extraction-api.nanonets.com/api/v1
Setup
1. Get Your API Key
Sign up and create an API key:
# Visit the dashboard https://docstrange.nanonets.com/app # Or use this direct link https://docstrange.nanonets.com/app?utm_source=openclaw
Save your API key:
export DOCSTRANGE_API_KEY="your_api_key_here"
2. OpenClaw Configuration (Optional)
Add to your ~/.openclaw/openclaw.json:
{
skills: {
entries: {
"docstrange": {
enabled: true,
apiKey: "your_api_key_here",
env: {
DOCSTRANGE_API_KEY: "your_api_key_here",
},
},
},
},
}
The apiKey field is a convenience for skills that declare a primary env var. The env object injects environment variables for the agent run.
3. Process Your First Document
Quick test (sync extraction):
curl -X POST "https://extraction-api.nanonets.com/api/v1/extract/sync" \ -H "Authorization: Bearer $DOCSTRANGE_API_KEY" \ -F "file=@document.pdf" \ -F "output_format=markdown"
Response:
{
"success": true,
"record_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "completed",
"result": {
"markdown": {
"content": "# Your Document\n\nExtracted content here..."
}
}
}
Inputs
Provide exactly one of:
- •
file(multipart upload; best for local files) - •
file_url(HTTPS URL) - •
file_base64(base64 content; use only when needed)
Workflow 1: extract_any (Markdown)
Convert any document to clean markdown. Use sync for small docs (<=5 pages).
curl -X POST "https://extraction-api.nanonets.com/api/v1/extract/sync" \ -H "Authorization: Bearer $DOCSTRANGE_API_KEY" \ -F "file=@/path/to/document.pdf" \ -F "output_format=markdown"
Response:
{
"success": true,
"message": "Extraction completed successfully",
"record_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "completed",
"result": {
"markdown": {
"content": "# Invoice\n\n**Invoice Number:** INV-2024-001\n**Date:** 2024-01-15\n\n| Item | Quantity | Price |\n|------|----------|-------|\n| Widget A | 10 | $50.00 |",
"metadata": {}
}
},
"processing_time": 1.23
}
Access the content: response["result"]["markdown"]["content"]
Workflow 2: extract_fields (JSON)
Extract specific fields using either a field list or a JSON schema.
Option A: Field List (Simple)
curl -X POST "https://extraction-api.nanonets.com/api/v1/extract/sync" \ -H "Authorization: Bearer $DOCSTRANGE_API_KEY" \ -F "file=@/path/to/invoice.pdf" \ -F "output_format=json" \ -F 'json_options=["invoice_number", "date", "total_amount", "vendor"]' \ -F "include_metadata=confidence_score"
Response with Confidence Scores:
{
"success": true,
"message": "Extraction completed successfully",
"record_id": "550e8400-e29b-41d4-a716-446655440001",
"status": "completed",
"result": {
"json": {
"content": {
"invoice_number": "INV-2024-001",
"date": "2024-01-15",
"vendor": "Acme Corp",
"total_amount": 500.00
},
"metadata": {
"confidence_score": {
"invoice_number": 98,
"date": 95,
"total_amount": 99,
"vendor": 96
}
}
}
},
"processing_time": 2.45
}
Access the content: response["result"]["json"]["content"]
Access confidence: response["result"]["json"]["metadata"]["confidence_score"]
Option B: JSON Schema (Typed)
curl -X POST "https://extraction-api.nanonets.com/api/v1/extract/sync" \
-H "Authorization: Bearer $DOCSTRANGE_API_KEY" \
-F "file=@/path/to/invoice.pdf" \
-F "output_format=json" \
-F 'json_options={"type": "object", "properties": {"invoice_number": {"type": "string", "description": "Unique invoice ID"}, "total_amount": {"type": "number", "description": "Total amount due"}}}'
Workflow 3: pdf_to_csv (Table Extraction)
Extract tables as CSV:
curl -X POST "https://extraction-api.nanonets.com/api/v1/extract/sync" \ -H "Authorization: Bearer $DOCSTRANGE_API_KEY" \ -F "file=@/path/to/table.pdf" \ -F "output_format=csv" \ -F "csv_options=table"
Response:
{
"success": true,
"message": "Extraction completed successfully",
"record_id": "550e8400-e29b-41d4-a716-446655440002",
"status": "completed",
"result": {
"csv": {
"content": "Item,Quantity,Price,Amount\nWidget A,10,$50.00,$500.00\nWidget B,5,$30.00,$150.00",
"metadata": {}
}
},
"processing_time": 1.85
}
Async Extraction (for >5 Pages)
For documents with more than 5 pages, use the async endpoint and poll for results.
Step 1: Queue the Document
curl -X POST "https://extraction-api.nanonets.com/api/v1/extract/async" \ -H "Authorization: Bearer $DOCSTRANGE_API_KEY" \ -F "file=@/path/to/large-document.pdf" \ -F "output_format=markdown"
Response:
{
"success": true,
"message": "Extraction job queued for processing. Use the record_id to check status.",
"record_id": "12345",
"status": "processing",
"result": null,
"filename": "large-document.pdf"
}
Step 2: Poll for Results
curl -X GET "https://extraction-api.nanonets.com/api/v1/extract/results/12345" \ -H "Authorization: Bearer $DOCSTRANGE_API_KEY"
Response (still processing):
{
"success": false,
"message": "Extraction is still processing. Please check back later.",
"record_id": "12345",
"status": "processing",
"result": null
}
Response (completed):
{
"success": true,
"message": "Extraction completed successfully",
"record_id": "12345",
"status": "completed",
"result": {
"markdown": {
"content": "# Document Title\n\nExtracted content...",
"metadata": {}
}
},
"processing_time": 15.5,
"pages_processed": 100
}
Python Polling Example
import requests
import time
def poll_result(record_id, api_key, max_wait=300, interval=5):
"""Poll for async extraction result."""
headers = {"Authorization": f"Bearer {api_key}"}
start = time.time()
while time.time() - start < max_wait:
response = requests.get(
f"https://extraction-api.nanonets.com/api/v1/extract/results/{record_id}",
headers=headers
)
result = response.json()
if result["status"] == "completed":
return result
elif result["status"] == "failed":
raise Exception(f"Extraction failed: {result['message']}")
time.sleep(interval)
raise TimeoutError("Extraction timed out")
Advanced Features
Bounding Boxes
Get coordinate data for each element (useful for document layout analysis):
curl -X POST "https://extraction-api.nanonets.com/api/v1/extract/sync" \ -H "Authorization: Bearer $DOCSTRANGE_API_KEY" \ -F "file=@/path/to/document.pdf" \ -F "output_format=markdown" \ -F "include_metadata=bounding_boxes"
Response includes element coordinates:
{
"result": {
"markdown": {
"content": "# Invoice...",
"metadata": {
"bounding_boxes": {
"elements": [
{
"content": "## Page 1",
"bounding_box": {
"x": 0.117,
"y": 0.072,
"width": 0.002,
"height": 0.002,
"confidence": 0.98,
"page": 1,
"normalized": true
}
}
],
"coordinates_normalized": true
}
}
}
}
}
Hierarchy Output
Extract document structure with sections, tables, and key-value pairs:
curl -X POST "https://extraction-api.nanonets.com/api/v1/extract/sync" \ -H "Authorization: Bearer $DOCSTRANGE_API_KEY" \ -F "file=@/path/to/document.pdf" \ -F "output_format=json" \ -F "json_options=hierarchy_output"
Response:
{
"result": {
"json": {
"content": {
"document": {
"title": "Invoice Document",
"sections": [
{
"id": "page_1_section_1",
"title": "Company Information",
"level": 1,
"content": "ACME CORPORATION\n123 Business Street"
}
],
"tables": [
{
"id": "page_1_table_1",
"title": "Invoice Items",
"headers": ["Item", "Quantity", "Price"],
"rows": [["Widget A", "10", "$50.00"]]
}
],
"key_value_pairs": [
{"key": "Invoice Number", "value": "INV-2024-001"}
]
}
}
}
}
}
Financial Documents Mode
Optimized for financial documents with enhanced table and number formatting:
curl -X POST "https://extraction-api.nanonets.com/api/v1/extract/sync" \ -H "Authorization: Bearer $DOCSTRANGE_API_KEY" \ -F "file=@/path/to/financial-report.pdf" \ -F "output_format=markdown" \ -F "markdown_options=financial-docs"
Custom Instructions
Guide the extraction with custom prompts:
curl -X POST "https://extraction-api.nanonets.com/api/v1/extract/sync" \ -H "Authorization: Bearer $DOCSTRANGE_API_KEY" \ -F "file=@/path/to/document.pdf" \ -F "output_format=markdown" \ -F "custom_instructions=Focus on extracting financial data. Ignore headers and footers." \ -F "prompt_mode=append"
Multiple Output Formats
Request multiple formats in a single call:
curl -X POST "https://extraction-api.nanonets.com/api/v1/extract/sync" \ -H "Authorization: Bearer $DOCSTRANGE_API_KEY" \ -F "file=@/path/to/document.pdf" \ -F "output_format=markdown,json"
Response contains both formats:
{
"result": {
"markdown": {
"content": "## Financial Data\n\n**Total:** $500.00",
"metadata": {}
},
"json": {
"content": {"total_amount": 500.00, "currency": "USD"},
"metadata": {}
}
}
}
When to Use This Skill
Use DocStrange For:
- •Invoice and receipt processing
- •Contract text extraction
- •Bank statement parsing
- •Form digitization
- •Any document to markdown conversion
- •Image OCR (scanned documents, photos of documents)
- •Structured data extraction with validation
Don't Use For:
- •Documents >5 pages with sync endpoint (use async instead)
- •Video or audio transcription
- •Non-document image analysis (photos, artwork)
- •Real-time streaming requirements (processing takes 1-15+ seconds)
Best Practices
Choosing Sync vs Async
| Document Size | Endpoint | Notes |
|---|---|---|
| <=5 pages | /extract/sync | Immediate response |
| >5 pages | /extract/async | Poll /extract/results/{record_id} |
JSON Extraction Methods
- •Field list (simple):
json_options=["invoice_number", "date", "total"]- •Best for: Quick extractions with known fields
- •JSON schema (typed):
json_options={"type": "object", "properties": {...}}- •Best for: Strict typing, nested structures, field descriptions
Confidence Scores
- •Add
include_metadata=confidence_scorefor critical extractions - •Scores are 0-100 per field (not 0-1)
- •Review fields with confidence <80 manually
- •Only available with
json_options(field list or schema)
Input Methods
- •Prefer
fileorfile_urloverfile_base64to save context tokens - •Use
filefor local files (multipart upload) - •Use
file_urlfor publicly accessible URLs
Schema Templates
Invoice Schema
{
"type": "object",
"properties": {
"invoice_number": {"type": "string", "description": "Unique invoice ID"},
"invoice_date": {"type": "string", "description": "Invoice date in YYYY-MM-DD"},
"vendor": {"type": "string", "description": "Vendor company name"},
"total": {"type": "number", "description": "Total amount due including tax"},
"line_items": {
"type": "array",
"items": {
"type": "object",
"properties": {
"description": {"type": "string"},
"quantity": {"type": "number"},
"price": {"type": "number"}
}
}
}
}
}
Receipt Schema
{
"type": "object",
"properties": {
"merchant": {"type": "string", "description": "Store or merchant name"},
"date": {"type": "string", "description": "Transaction date"},
"total": {"type": "number", "description": "Total amount paid"},
"items": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {"type": "string"},
"price": {"type": "number"}
}
}
}
}
}
Contract Schema
{
"type": "object",
"properties": {
"parties": {
"type": "array",
"items": {"type": "string"},
"description": "Names of all parties in the contract"
},
"effective_date": {"type": "string", "description": "Contract start date"},
"term_length": {"type": "string", "description": "Duration of contract"},
"termination_clause": {"type": "string", "description": "Conditions for termination"}
}
}
Troubleshooting
Error: 400 Bad Request
- •Provide exactly one input:
file,file_url, orfile_base64 - •Verify API key is valid and has not expired
- •Check that the file format is supported
Sync Timeout / Slow Response
- •Use async endpoint for documents >5 pages
- •Poll
/extract/results/{record_id}until status iscompleted
Missing Confidence Scores
- •Confidence scores require
json_options(field list or schema) - •Add
include_metadata=confidence_scoreto your request
Empty or Partial Results
- •Check if the document is scanned/image-based (OCR may need more time)
- •For complex layouts, try
markdown_options=financial-docs - •Use
custom_instructionsto guide extraction
Support
- •Full API Docs: https://docstrange.nanonets.com/docs
- •Documentation Index: https://docstrange.nanonets.com/docs/llms.txt
- •Get API Key: https://docstrange.nanonets.com/app