Tools
Ark Kb
๐๏ธ Ark Knowledge Base โ Local RAG plugin for OpenClaw. 75+ file formats, multi-modal embedding, LanceDB vector store. npm: njuboy11-ark-kb
Install
npm install ark-kb
README
# ๐๏ธ Ark KB
> **Ark Knowledge Base** โ Drop files in a folder. Everything else is automatic.
A personal knowledge base for AI assistants. Powered by **LanceDB** + **MinerU** + **Multimodal Embedding**. Auto-detects file types, auto-parses, auto-chunks, auto-embeds, auto-indexes. 90+ file formats, 14-file codebase, ~7,000 lines.
---
## โจ What It Does
```
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Your Knowledge Folder โ
โ โ
โ ๐ report.pdf ๐ data.xlsx ๐ฝ๏ธ deck.pptx โ
โ ๐ notes.md ๐ผ๏ธ photo.jpg ๐ฌ demo.mp4 โ
โ ๐ contract.docx ๐ง email attachments โ
โ โ
โ โ fs.watch (real-time) โ
โ โ detectFileKind (90+ formats) โ
โ โ complexity detection (Office XML) โ
โ โ MinerU / VLM / direct extract โ
โ โ chunkText (paragraph / fixed / sentence) โ
โ โ embed (vectorize) โ
โ โ upsert to LanceDB โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
You ask: "What was the Q2 revenue forecast?"
โ
I search: semantic hybrid search โ find the .xlsx chunk โ show results
```
---
## ๐ Supported Formats (90+)
### ๐ Text (46 formats)
`.md` `.txt` `.csv` `.html` `.htm` `.json` `.yaml` `.yml` `.xml` `.toml`
`.py` `.js` `.ts` `.jsx` `.tsx` `.java` `.c` `.cpp` `.h` `.go` `.rs` `.rb` `.php` `.sh` `.bash` `.sql` `.r` `.scala` `.lua`
`.css` `.scss` `.less` `.vue` `.swift` `.kt` `.dart`
`.log` `.conf` `.cfg` `.ini` `.env` `.tex` `.rst` `.org` `.adoc`
### ๐ข Office (6 formats)
| Format | Simple | Complex | Binary |
|--------|--------|---------|--------|
| **Word** | `.docx` (text only) โ `pipeline` | `.docx` (formulas/images) โ `vlm` | `.doc` โ `vlm` |
| **Excel** | `.xlsx` (data only) โ `pipeline` | `.xlsx` (charts/formulas) โ `vlm` | `.xls` โ `vlm` |
| **PPT** | `.pptx` (text only) โ `pipeline` | `.pptx` (charts/animations) โ `vlm` | `.ppt` โ `vlm` |
> **Complexity detection**: For `.docx` / `.xlsx` / `.pptx`, the plugin reads the ZIP-internal XML to detect formulas, charts, images, pivot tables, animations, etc. (10 markers each). Complex files automatically route to `vlm` model for higher accuracy.
### ๐ PDF
`.pdf` โ MinerU vlm (formulas, tables, images all extracted)
### ๐ HTML
`.html` `.htm` โ MinerU `MinerU-HTML` model (structured extraction)
### ๐ผ๏ธ Images (17 formats)
`.png` `.jpg` `.jpeg` `.jfif` `.webp` `.gif` `.bmp` `.svg` `.tiff` `.tif` `.ico` `.heic` `.heif` `.raw` `.cr2` `.nef` `.arw`
โ VLM summary โ text embedding (configurable to multimodal direct embedding)
### ๐ฌ Video (11 formats)
`.mp4` `.mov` `.avi` `.mkv` `.webm` `.wmv` `.flv` `.m4v` `.3gp` `.ogv` `.ts`
โ VLM frame analysis โ summary โ text embedding
### ๐ง Email Ingestion
IMAP-based (imapflow), auto-polls inbox, extracts `.txt` `.md` `.pdf` `.doc` `.docx` `.ppt` `.pptx` `.xls` `.xlsx` `.html` `.htm` `.png` `.jpg` `.jpeg` `.gif` `.svg` `.webp` `.bmp` attachments.
---
## ๐ง How It Works
### Ingestion Pipeline
```
file dropped / email received
โ
detectFileKind() โ FileKind
โ
โโ text / code โ direct chunking
โโ pdf โ MinerU API v4 (precision parsing)
โโ docx/pptx/xlsx โ isComplex() โ pipeline | vlm โ MinerU
โโ doc/ppt/xls โ MinerU vlm (binary)
โโ html โ MinerU MinerU-HTML
โโ image โ VLM describe โ text embed (or multimodal)
โโ video โ VLM frame โ summary โ text embed
โ
chunkText() (paragraph / fixed / sentence strategy)
โ
embedder.embed() โ vectors
โ
store.upsert() โ LanceDB
```
### Chunking Strategies
| Strategy | Behavior |
|----------|----------|
| `paragraph` | Split on blank lines, merge up to maxTokens |
| `fixed` | Fixed-size window with overlap |
| `sentence` | Split on sentence-ending punctuation |
### Hash Deduplication
SHA256-based content dedup with hash-level locking to prevent concurrent ingestion of identical files with different names.
### UID-based Email Tracking
Persists `lastProcessedUID` to avoid re-processing emails across restarts. Failed emails go to retry queue tracked by UID.
---
## ๐ ๏ธ OpenClaw Tools
| Tool | Description |
|------|-------------|
| `kb_search` | Hybrid semantic + keyword search with optional fileType filter |
| `kb_ingest` | Manually trigger indexing of a file or all files |
| `kb_remove` | Remove indexed chunks for a given source file |
| `kb_status` | Show total chunks, indexed files, config summary |
---
## ๐ Architecture
```
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Storage Layer โ
โ /path/to/knowledge/ (filesystem) โ
โ ~/.ark-kb/* (state, LanceDB) โ
โโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโ
โ Index Layer โ
โ LanceDB (embedded, zero-ops) โ
โ IVF-PQ ANN search, millisecond โ
โโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโ
โ Retrieval Layer โ
โ Hybrid: vector + BM25 โ
โ Optional: reranking (BGE-m3) โ
โ fileType filter, pagination โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
```
---
## ๐ฉ Technical Stack
| Component | Technology |
|---|---|
| Vector DB | **LanceDB** (embedded, zero-ops) |
| PDF/Office parsing | **MinerU API v4** (precision parsing + complexity routing) |
| Embedding | Configurable (MiniMax / Qwen / etc.) |
| Image understanding | **VLM** (MiniMax-VL / Qwen-VL) |
| Video summarization | **VLM** frame analysis |
| File watching | **fs.watch** + debounce |
| Email | **imapflow** (IMAP, configurable polling) |
| Search algo | **IVF-PQ ANN** + **BM25** keyword |
| Chunking | paragraph / fixed / sentence strategies |
| Dedup | SHA256 content hash + hash-level locking |
| Runtime | Node.js / TypeScript |
---
## ๐ Quick Start
```bash
# 1. Install
npm install ark-kb
# 2. Configure (plugin-config.json)
{
"knowledgePath": "/path/to/knowledge",
"embedding": { "model": "qwen/Qwen3-VL-Embedding-8B" },
"pdfParser": {
"api": "mineru",
"endpoint": "https://mineru.net/api/v4/extract/task",
"apiKey": "your-mineru-token"
}
}
# 3. Drop files into /path/to/knowledge
# โ Auto-indexed. No action needed.
# 4. Search via OpenClaw tool
kb_search(query="Q2 revenue forecast", fileType="xlsx")
```
---
## ๐ License
AGPL v3 ยฉ [njuboy11](https://github.com/njuboy11)
> **A small ark that holds your world.** ๐๏ธ
tools
Comments
Sign in to leave a comment