Personal DevelopmentDocumentedScanned

docstrange

Document extraction API by Nanonets.

Share:

Installation

npx clawhub@latest install docstrange

View the full skill documentation and source below.

Documentation

DocStrange by Nanonets

Document extraction API — convert PDFs, images, and documents to markdown, JSON, or CSV with per-field confidence scoring.

Get your API key:

Quick Start

curl -X POST "" \
  -H "Authorization: Bearer $DOCSTRANGE_API_KEY" \
  -F "file=@document.pdf" \
  -F "output_format=markdown"

Response:

{
  "success": true,
  "record_id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "completed",
  "result": {
    "markdown": {
      "content": "# Invoice\n\n**Invoice Number:** INV-2024-001..."
    }
  }
}

Setup

1. Get Your API Key

# Visit the dashboard

Save your API key:

export DOCSTRANGE_API_KEY="your_api_key_here"

2. OpenClaw Configuration (Optional)

Add to your ~/.openclaw/openclaw.json:

{
  skills: {
    entries: {
      "docstrange": {
        enabled: true,
        apiKey: "your_api_key_here",
        env: {
          DOCSTRANGE_API_KEY: "your_api_key_here",
        },
      },
    },
  },
}

Common Tasks

Extract to Markdown

curl -X POST "" \
  -H "Authorization: Bearer $DOCSTRANGE_API_KEY" \
  -F "file=@document.pdf" \
  -F "output_format=markdown"

Access content: response["result"]["markdown"]["content"]

Extract JSON Fields

Simple field list:

curl -X POST "" \
  -H "Authorization: Bearer $DOCSTRANGE_API_KEY" \
  -F "file=@invoice.pdf" \
  -F "output_format=json" \
  -F 'json_options=["invoice_number", "date", "total_amount", "vendor"]' \
  -F "include_metadata=confidence_score"

With JSON schema:

curl -X POST "" \
  -H "Authorization: Bearer $DOCSTRANGE_API_KEY" \
  -F "file=@invoice.pdf" \
  -F "output_format=json" \
  -F 'json_options={"type": "object", "properties": {"invoice_number": {"type": "string"}, "total_amount": {"type": "number"}}}'

Response with confidence scores:

{
  "result": {
    "json": {
      "content": {
        "invoice_number": "INV-2024-001",
        "total_amount": 500.00
      },
      "metadata": {
        "confidence_score": {
          "invoice_number": 98,
          "total_amount": 99
        }
      }
    }
  }
}

Extract Tables to CSV

curl -X POST "" \
  -H "Authorization: Bearer $DOCSTRANGE_API_KEY" \
  -F "file=@table.pdf" \
  -F "output_format=csv" \
  -F "csv_options=table"

Async Extraction (Large Documents)

For documents >5 pages, use async and poll:

Queue the document:

curl -X POST "" \
  -H "Authorization: Bearer $DOCSTRANGE_API_KEY" \
  -F "file=@large-document.pdf" \
  -F "output_format=markdown"

# Returns: {"record_id": "12345", "status": "processing"}

Poll for results:

curl -X GET "" \
  -H "Authorization: Bearer $DOCSTRANGE_API_KEY"

# Returns: {"status": "completed", "result": {...}}

Advanced Features

Bounding Boxes

Get element coordinates for layout analysis:
-F "include_metadata=bounding_boxes"

Hierarchy Output

Extract document structure (sections, tables, key-value pairs):
-F "json_options=hierarchy_output"

Financial Documents Mode

Enhanced table and number formatting:
-F "markdown_options=financial-docs"

Custom Instructions

Guide extraction with prompts:
-F "custom_instructions=Focus on financial data. Ignore headers."
-F "prompt_mode=append"

Multiple Formats

Request multiple formats in one call:
-F "output_format=markdown,json"

When to Use

Use DocStrange For:

  • Invoice and receipt processing
  • Contract text extraction
  • Bank statement parsing
  • Form digitization
  • Image OCR (scanned documents)

Don't Use For:

  • Documents >5 pages with sync (use async)
  • Video/audio transcription
  • Non-document images

Best Practices

Document SizeEndpointNotes
<=5 pages/extract/syncImmediate response
>5 pages/extract/asyncPoll for results
JSON Extraction:
  • Field list: ["field1", "field2"] — quick extractions
  • JSON schema: {"type": "object", ...} — strict typing, nested data
Confidence Scores:
  • Add include_metadata=confidence_score
  • Scores are 0-100 per field
  • Review fields <80 manually

Schema Templates

Invoice

{
  "type": "object",
  "properties": {
    "invoice_number": {"type": "string"},
    "date": {"type": "string"},
    "vendor": {"type": "string"},
    "total": {"type": "number"},
    "line_items": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "description": {"type": "string"},
          "quantity": {"type": "number"},
          "price": {"type": "number"}
        }
      }
    }
  }
}

Receipt

{
  "type": "object",
  "properties": {
    "merchant": {"type": "string"},
    "date": {"type": "string"},
    "total": {"type": "number"},
    "items": {
      "type": "array",
      "items": {"type": "object", "properties": {"name": {"type": "string"}, "price": {"type": "number"}}}
    }
  }
}

Troubleshooting

400 Bad Request:

  • Provide exactly one input: file, file_url, or file_base64

  • Verify API key is valid


Sync Timeout:
  • Use async for documents >5 pages

  • Poll /extract/results/{record_id}


Missing Confidence Scores:
  • Requires json_options (field list or schema)

  • Add include_metadata=confidence_score


References

  • API Docs:
  • Get API Key: