Integration
Pdf Extract Text
OpenClaw plugin to extract text from PDFs — plain text extraction, structured JSON parsing with layout blocks and tables, OCR for scanned PDFs with multi-language support, and PDF-to-Word/Excel/CSV/HTML conversion. Powered by PDFAPIHub API. Free API key at pdfapihub.com
Install
openclaw plugins install clawhub:pdf-extract-text
Configuration Example
{
"plugins": {
"entries": {
"pdf-extract-text": {
"enabled": true,
"env": {
"PDFAPIHUB_API_KEY": "your-api-key-here"
}
}
}
}
}
README
# PDF Extract Text — OpenClaw Plugin
Extract text from PDFs using the [PDFAPIHub](https://pdfapihub.com) API. This OpenClaw plugin gives your AI agent 8 tools for plain text extraction, structured parsing, OCR, and format conversion.
## What It Does
Pull text content from any PDF — digital or scanned — using the best method for the job. Extract plain text, parse into structured JSON with tables and bounding boxes, OCR scanned documents in 100+ languages, or convert to Word/Excel/CSV/HTML.
### Features
- **Plain Text Extraction** — Fast text extraction from digital PDFs with page selection
- **Structured Parsing** — JSON output with layout blocks, normalized bounding boxes, tables, and image metadata
- **4 Parse Modes** — text, layout, tables, full (text + blocks + tables + images)
- **PDF OCR** — Tesseract OCR for scanned PDFs with configurable DPI (72-400)
- **Image OCR** — OCR photos of receipts, documents, signs with preprocessing
- **Multi-Language OCR** — 100+ languages (eng, hin, fra, deu, etc.), combine with `+`
- **Word-Level Bounding Boxes** — Per-word positions and confidence scores
- **Character Whitelisting** — Restrict OCR to digits-only for invoices/meters
- **Image Preprocessing** — Grayscale, sharpen, threshold, resize for noisy inputs
- **PDF to DOCX** — Editable Word documents with formatting preserved
- **PDF to Excel** — Tables extracted into XLSX (one sheet per page)
- **PDF to CSV** — Tabular data for databases and BI tools
- **PDF to HTML** — Styled HTML for web publishing
## Tools
| Tool | Description |
|------|-------------|
| `extract_text_from_pdf` | Extract plain text from PDF pages |
| `parse_pdf` | Parse into structured JSON (text, layout blocks, tables, images) |
| `ocr_pdf` | OCR scanned PDFs with multi-language Tesseract support |
| `ocr_image` | OCR images with preprocessing |
| `pdf_to_docx` | Convert PDF to editable Word document |
| `pdf_to_excel` | Extract tables into Excel workbook |
| `pdf_to_csv` | Extract tables into CSV format |
| `pdf_to_html` | Convert PDF to styled HTML |
## Installation
```bash
openclaw plugins install clawhub:pdf-extract-text
```
## Configuration
Add your API key in `~/.openclaw/openclaw.json`:
```json
{
"plugins": {
"entries": {
"pdf-extract-text": {
"enabled": true,
"env": {
"PDFAPIHUB_API_KEY": "your-api-key-here"
}
}
}
}
}
```
Get your **free API key** at [https://pdfapihub.com](https://pdfapihub.com).
## Usage Examples
Just ask your OpenClaw agent:
- *"Extract all text from this PDF"*
- *"Parse the tables from this invoice"*
- *"OCR this scanned document in English and Hindi"*
- *"Convert this PDF to an Excel spreadsheet"*
- *"Extract text from this receipt photo"*
- *"Convert pages 1-3 to a Word document"*
- *"Get the structured layout with bounding boxes"*
## Use Cases
- **Invoice Parsing** — Extract line items, totals, and vendor info from PDF invoices
- **Resume Parsing** — Extract name, experience, and skills from PDF resumes
- **Full-Text Search** — Extract text for indexing in search engines
- **AI/LLM Processing** — Feed PDF text into language models or chatbots
- **Financial Data** — Extract tables from bank statements into Excel/CSV
- **Receipt Scanning** — OCR receipts and invoices for expense tracking
- **Document Digitization** — Convert scanned legacy documents into searchable text
- **Content Migration** — Pull text from PDFs for migration to new systems
- **Translation Workflows** — Convert PDFs to DOCX for easier translation
## API Documentation
Full API docs: [https://pdfapihub.com/docs](https://pdfapihub.com/docs)
## License
MIT
integration
Comments
Sign in to leave a comment