AgentSkillsCN

docstrange

Nanonets 提供的 AI 驱动文档提取服务。可将 PDF、图片及各类文档高效转换为 Markdown、JSON、CSV 或 HTML 格式,并附带置信度评分。当用户提出将 PDF 转换为 Markdown、提取发票数据、对文档进行 OCR 处理、从 PDF 中提取指定字段、解析收据、将表格转为 CSV、从银行对账单中提取信息,或需要结构化数据提取时,可使用此技能。

SKILL.md
--- frontmatter
name: docstrange
description: AI-powered document extraction by Nanonets. Converts PDFs, images, and documents to markdown, JSON, CSV, or HTML with confidence scoring. Use when user asks to convert PDF to markdown, extract invoice data, OCR a document, extract fields from PDF, parse receipt, convert table to CSV, extract from bank statement, or needs structured data extraction.

DocStrange by Nanonets

DocStrange is Nanonets' document extraction API. Use it to convert documents to markdown, extract structured JSON fields, and extract tables as CSV.

What This Skill Does

DocStrange uses advanced AI models to extract structured content from documents:

  • Markdown: Convert PDFs, images, Word docs, Excel to clean markdown
  • JSON: Extract specific fields with confidence scores (0-100)
  • CSV: Extract table data
  • HTML: Formatted HTML output
  • Image Analysis: OCR for scanned documents and images
  • Sync/Async: Sync for <=5 pages, async for larger documents

Supported Input Formats: PDF, JPG, JPEG, PNG, TIFF, Word (.docx), Excel (.xlsx)

Base API: https://extraction-api.nanonets.com/api/v1

Setup

1. Get Your API Key

Sign up and create an API key:

bash
# Visit the dashboard
https://docstrange.nanonets.com/app

# Or use this direct link
https://docstrange.nanonets.com/app?utm_source=openclaw

Save your API key:

bash
export DOCSTRANGE_API_KEY="your_api_key_here"

2. OpenClaw Configuration (Optional)

Add to your ~/.openclaw/openclaw.json:

json5
{
  skills: {
    entries: {
      "docstrange": {
        enabled: true,
        apiKey: "your_api_key_here",
        env: {
          DOCSTRANGE_API_KEY: "your_api_key_here",
        },
      },
    },
  },
}

The apiKey field is a convenience for skills that declare a primary env var. The env object injects environment variables for the agent run.

3. Process Your First Document

Quick test (sync extraction):

bash
curl -X POST "https://extraction-api.nanonets.com/api/v1/extract/sync" \
  -H "Authorization: Bearer $DOCSTRANGE_API_KEY" \
  -F "file=@document.pdf" \
  -F "output_format=markdown"

Response:

json
{
  "success": true,
  "record_id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "completed",
  "result": {
    "markdown": {
      "content": "# Your Document\n\nExtracted content here..."
    }
  }
}

Inputs

Provide exactly one of:

  • file (multipart upload; best for local files)
  • file_url (HTTPS URL)
  • file_base64 (base64 content; use only when needed)

Workflow 1: extract_any (Markdown)

Convert any document to clean markdown. Use sync for small docs (<=5 pages).

bash
curl -X POST "https://extraction-api.nanonets.com/api/v1/extract/sync" \
  -H "Authorization: Bearer $DOCSTRANGE_API_KEY" \
  -F "file=@/path/to/document.pdf" \
  -F "output_format=markdown"

Response:

json
{
  "success": true,
  "message": "Extraction completed successfully",
  "record_id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "completed",
  "result": {
    "markdown": {
      "content": "# Invoice\n\n**Invoice Number:** INV-2024-001\n**Date:** 2024-01-15\n\n| Item | Quantity | Price |\n|------|----------|-------|\n| Widget A | 10 | $50.00 |",
      "metadata": {}
    }
  },
  "processing_time": 1.23
}

Access the content: response["result"]["markdown"]["content"]


Workflow 2: extract_fields (JSON)

Extract specific fields using either a field list or a JSON schema.

Option A: Field List (Simple)

bash
curl -X POST "https://extraction-api.nanonets.com/api/v1/extract/sync" \
  -H "Authorization: Bearer $DOCSTRANGE_API_KEY" \
  -F "file=@/path/to/invoice.pdf" \
  -F "output_format=json" \
  -F 'json_options=["invoice_number", "date", "total_amount", "vendor"]' \
  -F "include_metadata=confidence_score"

Response with Confidence Scores:

json
{
  "success": true,
  "message": "Extraction completed successfully",
  "record_id": "550e8400-e29b-41d4-a716-446655440001",
  "status": "completed",
  "result": {
    "json": {
      "content": {
        "invoice_number": "INV-2024-001",
        "date": "2024-01-15",
        "vendor": "Acme Corp",
        "total_amount": 500.00
      },
      "metadata": {
        "confidence_score": {
          "invoice_number": 98,
          "date": 95,
          "total_amount": 99,
          "vendor": 96
        }
      }
    }
  },
  "processing_time": 2.45
}

Access the content: response["result"]["json"]["content"] Access confidence: response["result"]["json"]["metadata"]["confidence_score"]

Option B: JSON Schema (Typed)

bash
curl -X POST "https://extraction-api.nanonets.com/api/v1/extract/sync" \
  -H "Authorization: Bearer $DOCSTRANGE_API_KEY" \
  -F "file=@/path/to/invoice.pdf" \
  -F "output_format=json" \
  -F 'json_options={"type": "object", "properties": {"invoice_number": {"type": "string", "description": "Unique invoice ID"}, "total_amount": {"type": "number", "description": "Total amount due"}}}'

Workflow 3: pdf_to_csv (Table Extraction)

Extract tables as CSV:

bash
curl -X POST "https://extraction-api.nanonets.com/api/v1/extract/sync" \
  -H "Authorization: Bearer $DOCSTRANGE_API_KEY" \
  -F "file=@/path/to/table.pdf" \
  -F "output_format=csv" \
  -F "csv_options=table"

Response:

json
{
  "success": true,
  "message": "Extraction completed successfully",
  "record_id": "550e8400-e29b-41d4-a716-446655440002",
  "status": "completed",
  "result": {
    "csv": {
      "content": "Item,Quantity,Price,Amount\nWidget A,10,$50.00,$500.00\nWidget B,5,$30.00,$150.00",
      "metadata": {}
    }
  },
  "processing_time": 1.85
}

Async Extraction (for >5 Pages)

For documents with more than 5 pages, use the async endpoint and poll for results.

Step 1: Queue the Document

bash
curl -X POST "https://extraction-api.nanonets.com/api/v1/extract/async" \
  -H "Authorization: Bearer $DOCSTRANGE_API_KEY" \
  -F "file=@/path/to/large-document.pdf" \
  -F "output_format=markdown"

Response:

json
{
  "success": true,
  "message": "Extraction job queued for processing. Use the record_id to check status.",
  "record_id": "12345",
  "status": "processing",
  "result": null,
  "filename": "large-document.pdf"
}

Step 2: Poll for Results

bash
curl -X GET "https://extraction-api.nanonets.com/api/v1/extract/results/12345" \
  -H "Authorization: Bearer $DOCSTRANGE_API_KEY"

Response (still processing):

json
{
  "success": false,
  "message": "Extraction is still processing. Please check back later.",
  "record_id": "12345",
  "status": "processing",
  "result": null
}

Response (completed):

json
{
  "success": true,
  "message": "Extraction completed successfully",
  "record_id": "12345",
  "status": "completed",
  "result": {
    "markdown": {
      "content": "# Document Title\n\nExtracted content...",
      "metadata": {}
    }
  },
  "processing_time": 15.5,
  "pages_processed": 100
}

Python Polling Example

python
import requests
import time

def poll_result(record_id, api_key, max_wait=300, interval=5):
    """Poll for async extraction result."""
    headers = {"Authorization": f"Bearer {api_key}"}
    start = time.time()
    
    while time.time() - start < max_wait:
        response = requests.get(
            f"https://extraction-api.nanonets.com/api/v1/extract/results/{record_id}",
            headers=headers
        )
        result = response.json()
        
        if result["status"] == "completed":
            return result
        elif result["status"] == "failed":
            raise Exception(f"Extraction failed: {result['message']}")
        
        time.sleep(interval)
    
    raise TimeoutError("Extraction timed out")

Advanced Features

Bounding Boxes

Get coordinate data for each element (useful for document layout analysis):

bash
curl -X POST "https://extraction-api.nanonets.com/api/v1/extract/sync" \
  -H "Authorization: Bearer $DOCSTRANGE_API_KEY" \
  -F "file=@/path/to/document.pdf" \
  -F "output_format=markdown" \
  -F "include_metadata=bounding_boxes"

Response includes element coordinates:

json
{
  "result": {
    "markdown": {
      "content": "# Invoice...",
      "metadata": {
        "bounding_boxes": {
          "elements": [
            {
              "content": "## Page 1",
              "bounding_box": {
                "x": 0.117,
                "y": 0.072,
                "width": 0.002,
                "height": 0.002,
                "confidence": 0.98,
                "page": 1,
                "normalized": true
              }
            }
          ],
          "coordinates_normalized": true
        }
      }
    }
  }
}

Hierarchy Output

Extract document structure with sections, tables, and key-value pairs:

bash
curl -X POST "https://extraction-api.nanonets.com/api/v1/extract/sync" \
  -H "Authorization: Bearer $DOCSTRANGE_API_KEY" \
  -F "file=@/path/to/document.pdf" \
  -F "output_format=json" \
  -F "json_options=hierarchy_output"

Response:

json
{
  "result": {
    "json": {
      "content": {
        "document": {
          "title": "Invoice Document",
          "sections": [
            {
              "id": "page_1_section_1",
              "title": "Company Information",
              "level": 1,
              "content": "ACME CORPORATION\n123 Business Street"
            }
          ],
          "tables": [
            {
              "id": "page_1_table_1",
              "title": "Invoice Items",
              "headers": ["Item", "Quantity", "Price"],
              "rows": [["Widget A", "10", "$50.00"]]
            }
          ],
          "key_value_pairs": [
            {"key": "Invoice Number", "value": "INV-2024-001"}
          ]
        }
      }
    }
  }
}

Financial Documents Mode

Optimized for financial documents with enhanced table and number formatting:

bash
curl -X POST "https://extraction-api.nanonets.com/api/v1/extract/sync" \
  -H "Authorization: Bearer $DOCSTRANGE_API_KEY" \
  -F "file=@/path/to/financial-report.pdf" \
  -F "output_format=markdown" \
  -F "markdown_options=financial-docs"

Custom Instructions

Guide the extraction with custom prompts:

bash
curl -X POST "https://extraction-api.nanonets.com/api/v1/extract/sync" \
  -H "Authorization: Bearer $DOCSTRANGE_API_KEY" \
  -F "file=@/path/to/document.pdf" \
  -F "output_format=markdown" \
  -F "custom_instructions=Focus on extracting financial data. Ignore headers and footers." \
  -F "prompt_mode=append"

Multiple Output Formats

Request multiple formats in a single call:

bash
curl -X POST "https://extraction-api.nanonets.com/api/v1/extract/sync" \
  -H "Authorization: Bearer $DOCSTRANGE_API_KEY" \
  -F "file=@/path/to/document.pdf" \
  -F "output_format=markdown,json"

Response contains both formats:

json
{
  "result": {
    "markdown": {
      "content": "## Financial Data\n\n**Total:** $500.00",
      "metadata": {}
    },
    "json": {
      "content": {"total_amount": 500.00, "currency": "USD"},
      "metadata": {}
    }
  }
}

When to Use This Skill

Use DocStrange For:

  • Invoice and receipt processing
  • Contract text extraction
  • Bank statement parsing
  • Form digitization
  • Any document to markdown conversion
  • Image OCR (scanned documents, photos of documents)
  • Structured data extraction with validation

Don't Use For:

  • Documents >5 pages with sync endpoint (use async instead)
  • Video or audio transcription
  • Non-document image analysis (photos, artwork)
  • Real-time streaming requirements (processing takes 1-15+ seconds)

Best Practices

Choosing Sync vs Async

Document SizeEndpointNotes
<=5 pages/extract/syncImmediate response
>5 pages/extract/asyncPoll /extract/results/{record_id}

JSON Extraction Methods

  • Field list (simple): json_options=["invoice_number", "date", "total"]
    • Best for: Quick extractions with known fields
  • JSON schema (typed): json_options={"type": "object", "properties": {...}}
    • Best for: Strict typing, nested structures, field descriptions

Confidence Scores

  • Add include_metadata=confidence_score for critical extractions
  • Scores are 0-100 per field (not 0-1)
  • Review fields with confidence <80 manually
  • Only available with json_options (field list or schema)

Input Methods

  • Prefer file or file_url over file_base64 to save context tokens
  • Use file for local files (multipart upload)
  • Use file_url for publicly accessible URLs

Schema Templates

Invoice Schema

json
{
  "type": "object",
  "properties": {
    "invoice_number": {"type": "string", "description": "Unique invoice ID"},
    "invoice_date": {"type": "string", "description": "Invoice date in YYYY-MM-DD"},
    "vendor": {"type": "string", "description": "Vendor company name"},
    "total": {"type": "number", "description": "Total amount due including tax"},
    "line_items": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "description": {"type": "string"},
          "quantity": {"type": "number"},
          "price": {"type": "number"}
        }
      }
    }
  }
}

Receipt Schema

json
{
  "type": "object",
  "properties": {
    "merchant": {"type": "string", "description": "Store or merchant name"},
    "date": {"type": "string", "description": "Transaction date"},
    "total": {"type": "number", "description": "Total amount paid"},
    "items": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "name": {"type": "string"},
          "price": {"type": "number"}
        }
      }
    }
  }
}

Contract Schema

json
{
  "type": "object",
  "properties": {
    "parties": {
      "type": "array",
      "items": {"type": "string"},
      "description": "Names of all parties in the contract"
    },
    "effective_date": {"type": "string", "description": "Contract start date"},
    "term_length": {"type": "string", "description": "Duration of contract"},
    "termination_clause": {"type": "string", "description": "Conditions for termination"}
  }
}

Troubleshooting

Error: 400 Bad Request

  • Provide exactly one input: file, file_url, or file_base64
  • Verify API key is valid and has not expired
  • Check that the file format is supported

Sync Timeout / Slow Response

  • Use async endpoint for documents >5 pages
  • Poll /extract/results/{record_id} until status is completed

Missing Confidence Scores

  • Confidence scores require json_options (field list or schema)
  • Add include_metadata=confidence_score to your request

Empty or Partial Results

  • Check if the document is scanned/image-based (OCR may need more time)
  • For complex layouts, try markdown_options=financial-docs
  • Use custom_instructions to guide extraction

Support