← Back to Plugins
Integration

Pdf Extract Text

PdfApiHub By PdfApiHub 👁 10 views ▲ 0 votes

OpenClaw plugin to extract text from PDFs — plain text extraction, structured JSON parsing with layout blocks and tables, OCR for scanned PDFs with multi-language support, and PDF-to-Word/Excel/CSV/HTML conversion. Powered by PDFAPIHub API. Free API key at pdfapihub.com

GitHub

Install

openclaw plugins install clawhub:pdf-extract-text

Configuration Example

{
  "plugins": {
    "entries": {
      "pdf-extract-text": {
        "enabled": true,
        "env": {
          "PDFAPIHUB_API_KEY": "your-api-key-here"
        }
      }
    }
  }
}

README

# PDF Extract Text — OpenClaw Plugin

Extract text from PDFs using the [PDFAPIHub](https://pdfapihub.com) API. This OpenClaw plugin gives your AI agent 8 tools for plain text extraction, structured parsing, OCR, and format conversion.

## What It Does

Pull text content from any PDF — digital or scanned — using the best method for the job. Extract plain text, parse into structured JSON with tables and bounding boxes, OCR scanned documents in 100+ languages, or convert to Word/Excel/CSV/HTML.

### Features

- **Plain Text Extraction** — Fast text extraction from digital PDFs with page selection
- **Structured Parsing** — JSON output with layout blocks, normalized bounding boxes, tables, and image metadata
- **4 Parse Modes** — text, layout, tables, full (text + blocks + tables + images)
- **PDF OCR** — Tesseract OCR for scanned PDFs with configurable DPI (72-400)
- **Image OCR** — OCR photos of receipts, documents, signs with preprocessing
- **Multi-Language OCR** — 100+ languages (eng, hin, fra, deu, etc.), combine with `+`
- **Word-Level Bounding Boxes** — Per-word positions and confidence scores
- **Character Whitelisting** — Restrict OCR to digits-only for invoices/meters
- **Image Preprocessing** — Grayscale, sharpen, threshold, resize for noisy inputs
- **PDF to DOCX** — Editable Word documents with formatting preserved
- **PDF to Excel** — Tables extracted into XLSX (one sheet per page)
- **PDF to CSV** — Tabular data for databases and BI tools
- **PDF to HTML** — Styled HTML for web publishing

## Tools

| Tool | Description |
|------|-------------|
| `extract_text_from_pdf` | Extract plain text from PDF pages |
| `parse_pdf` | Parse into structured JSON (text, layout blocks, tables, images) |
| `ocr_pdf` | OCR scanned PDFs with multi-language Tesseract support |
| `ocr_image` | OCR images with preprocessing |
| `pdf_to_docx` | Convert PDF to editable Word document |
| `pdf_to_excel` | Extract tables into Excel workbook |
| `pdf_to_csv` | Extract tables into CSV format |
| `pdf_to_html` | Convert PDF to styled HTML |

## Installation

```bash
openclaw plugins install clawhub:pdf-extract-text
```

## Configuration

Add your API key in `~/.openclaw/openclaw.json`:

```json
{
  "plugins": {
    "entries": {
      "pdf-extract-text": {
        "enabled": true,
        "env": {
          "PDFAPIHUB_API_KEY": "your-api-key-here"
        }
      }
    }
  }
}
```

Get your **free API key** at [https://pdfapihub.com](https://pdfapihub.com).

## Usage Examples

Just ask your OpenClaw agent:

- *"Extract all text from this PDF"*
- *"Parse the tables from this invoice"*
- *"OCR this scanned document in English and Hindi"*
- *"Convert this PDF to an Excel spreadsheet"*
- *"Extract text from this receipt photo"*
- *"Convert pages 1-3 to a Word document"*
- *"Get the structured layout with bounding boxes"*

## Use Cases

- **Invoice Parsing** — Extract line items, totals, and vendor info from PDF invoices
- **Resume Parsing** — Extract name, experience, and skills from PDF resumes
- **Full-Text Search** — Extract text for indexing in search engines
- **AI/LLM Processing** — Feed PDF text into language models or chatbots
- **Financial Data** — Extract tables from bank statements into Excel/CSV
- **Receipt Scanning** — OCR receipts and invoices for expense tracking
- **Document Digitization** — Convert scanned legacy documents into searchable text
- **Content Migration** — Pull text from PDFs for migration to new systems
- **Translation Workflows** — Convert PDFs to DOCX for easier translation

## API Documentation

Full API docs: [https://pdfapihub.com/docs](https://pdfapihub.com/docs)

## License

MIT
integration

Comments

Sign in to leave a comment

Loading comments...