← Back to Plugins
Tools

Ark Kb

njuboy11 By njuboy11 👁 16 views ▲ 0 votes

๐Ÿ›๏ธ Ark Knowledge Base โ€” Local RAG plugin for OpenClaw. 75+ file formats, multi-modal embedding, LanceDB vector store. npm: njuboy11-ark-kb

Homepage GitHub

Install

npm install ark-kb

README

# ๐Ÿ›๏ธ Ark KB

> **Ark Knowledge Base** โ€” Drop files in a folder. Everything else is automatic.

A personal knowledge base for AI assistants. Powered by **LanceDB** + **MinerU** + **Multimodal Embedding**. Auto-detects file types, auto-parses, auto-chunks, auto-embeds, auto-indexes. 90+ file formats, 14-file codebase, ~7,000 lines.

---

## โœจ What It Does

```
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚              Your Knowledge Folder                   โ”‚
โ”‚                                                     โ”‚
โ”‚  ๐Ÿ“„ report.pdf      ๐Ÿ“Š data.xlsx     ๐Ÿ“ฝ๏ธ deck.pptx   โ”‚
โ”‚  ๐Ÿ“ notes.md        ๐Ÿ–ผ๏ธ photo.jpg     ๐ŸŽฌ demo.mp4    โ”‚
โ”‚  ๐Ÿ“„ contract.docx   ๐Ÿ“ง email attachments             โ”‚
โ”‚                                                     โ”‚
โ”‚         โ†“  fs.watch  (real-time)                     โ”‚
โ”‚         โ†“  detectFileKind  (90+ formats)             โ”‚
โ”‚         โ†“  complexity detection (Office XML)          โ”‚
โ”‚         โ†“  MinerU / VLM / direct extract             โ”‚
โ”‚         โ†“  chunkText  (paragraph / fixed / sentence) โ”‚
โ”‚         โ†“  embed  (vectorize)                        โ”‚
โ”‚         โ†“  upsert to LanceDB                         โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

You ask: "What was the Q2 revenue forecast?"
   โ†“
I search: semantic hybrid search โ†’ find the .xlsx chunk โ†’ show results
```

---

## ๐Ÿ“‚ Supported Formats (90+)

### ๐Ÿ“ Text (46 formats)
`.md` `.txt` `.csv` `.html` `.htm` `.json` `.yaml` `.yml` `.xml` `.toml`

`.py` `.js` `.ts` `.jsx` `.tsx` `.java` `.c` `.cpp` `.h` `.go` `.rs` `.rb` `.php` `.sh` `.bash` `.sql` `.r` `.scala` `.lua`

`.css` `.scss` `.less` `.vue` `.swift` `.kt` `.dart`

`.log` `.conf` `.cfg` `.ini` `.env` `.tex` `.rst` `.org` `.adoc`

### ๐Ÿข Office (6 formats)

| Format | Simple | Complex | Binary |
|--------|--------|---------|--------|
| **Word** | `.docx` (text only) โ†’ `pipeline` | `.docx` (formulas/images) โ†’ `vlm` | `.doc` โ†’ `vlm` |
| **Excel** | `.xlsx` (data only) โ†’ `pipeline` | `.xlsx` (charts/formulas) โ†’ `vlm` | `.xls` โ†’ `vlm` |
| **PPT** | `.pptx` (text only) โ†’ `pipeline` | `.pptx` (charts/animations) โ†’ `vlm` | `.ppt` โ†’ `vlm` |

> **Complexity detection**: For `.docx` / `.xlsx` / `.pptx`, the plugin reads the ZIP-internal XML to detect formulas, charts, images, pivot tables, animations, etc. (10 markers each). Complex files automatically route to `vlm` model for higher accuracy.

### ๐Ÿ“„ PDF
`.pdf` โ†’ MinerU vlm (formulas, tables, images all extracted)

### ๐ŸŒ HTML
`.html` `.htm` โ†’ MinerU `MinerU-HTML` model (structured extraction)

### ๐Ÿ–ผ๏ธ Images (17 formats)
`.png` `.jpg` `.jpeg` `.jfif` `.webp` `.gif` `.bmp` `.svg` `.tiff` `.tif` `.ico` `.heic` `.heif` `.raw` `.cr2` `.nef` `.arw`

โ†’ VLM summary โ†’ text embedding (configurable to multimodal direct embedding)

### ๐ŸŽฌ Video (11 formats)
`.mp4` `.mov` `.avi` `.mkv` `.webm` `.wmv` `.flv` `.m4v` `.3gp` `.ogv` `.ts`

โ†’ VLM frame analysis โ†’ summary โ†’ text embedding

### ๐Ÿ“ง Email Ingestion
IMAP-based (imapflow), auto-polls inbox, extracts `.txt` `.md` `.pdf` `.doc` `.docx` `.ppt` `.pptx` `.xls` `.xlsx` `.html` `.htm` `.png` `.jpg` `.jpeg` `.gif` `.svg` `.webp` `.bmp` attachments.

---

## ๐Ÿ”ง How It Works

### Ingestion Pipeline

```
file dropped / email received
    โ†“
detectFileKind() โ†’ FileKind
    โ†“
โ”œโ”€ text / code   โ†’ direct chunking
โ”œโ”€ pdf           โ†’ MinerU API v4 (precision parsing)
โ”œโ”€ docx/pptx/xlsx โ†’ isComplex() โ†’ pipeline | vlm โ†’ MinerU
โ”œโ”€ doc/ppt/xls   โ†’ MinerU vlm (binary)
โ”œโ”€ html          โ†’ MinerU MinerU-HTML
โ”œโ”€ image         โ†’ VLM describe โ†’ text embed (or multimodal)
โ”œโ”€ video         โ†’ VLM frame โ†’ summary โ†’ text embed
    โ†“
chunkText() (paragraph / fixed / sentence strategy)
    โ†“
embedder.embed() โ†’ vectors
    โ†“
store.upsert() โ†’ LanceDB
```

### Chunking Strategies
| Strategy | Behavior |
|----------|----------|
| `paragraph` | Split on blank lines, merge up to maxTokens |
| `fixed` | Fixed-size window with overlap |
| `sentence` | Split on sentence-ending punctuation |

### Hash Deduplication
SHA256-based content dedup with hash-level locking to prevent concurrent ingestion of identical files with different names.

### UID-based Email Tracking
Persists `lastProcessedUID` to avoid re-processing emails across restarts. Failed emails go to retry queue tracked by UID.

---

## ๐Ÿ› ๏ธ OpenClaw Tools

| Tool | Description |
|------|-------------|
| `kb_search` | Hybrid semantic + keyword search with optional fileType filter |
| `kb_ingest` | Manually trigger indexing of a file or all files |
| `kb_remove` | Remove indexed chunks for a given source file |
| `kb_status` | Show total chunks, indexed files, config summary |

---

## ๐Ÿ“ Architecture

```
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚          Storage Layer               โ”‚
โ”‚  /path/to/knowledge/ (filesystem)    โ”‚
โ”‚  ~/.ark-kb/* (state, LanceDB)        โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
               โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚          Index Layer                 โ”‚
โ”‚  LanceDB (embedded, zero-ops)        โ”‚
โ”‚  IVF-PQ ANN search, millisecond       โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
               โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚          Retrieval Layer             โ”‚
โ”‚  Hybrid: vector + BM25               โ”‚
โ”‚  Optional: reranking (BGE-m3)        โ”‚
โ”‚  fileType filter, pagination         โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
```

---

## ๐Ÿ”ฉ Technical Stack

| Component | Technology |
|---|---|
| Vector DB | **LanceDB** (embedded, zero-ops) |
| PDF/Office parsing | **MinerU API v4** (precision parsing + complexity routing) |
| Embedding | Configurable (MiniMax / Qwen / etc.) |
| Image understanding | **VLM** (MiniMax-VL / Qwen-VL) |
| Video summarization | **VLM** frame analysis |
| File watching | **fs.watch** + debounce |
| Email | **imapflow** (IMAP, configurable polling) |
| Search algo | **IVF-PQ ANN** + **BM25** keyword |
| Chunking | paragraph / fixed / sentence strategies |
| Dedup | SHA256 content hash + hash-level locking |
| Runtime | Node.js / TypeScript |

---

## ๐Ÿš€ Quick Start

```bash
# 1. Install
npm install ark-kb

# 2. Configure (plugin-config.json)
{
  "knowledgePath": "/path/to/knowledge",
  "embedding": { "model": "qwen/Qwen3-VL-Embedding-8B" },
  "pdfParser": {
    "api": "mineru",
    "endpoint": "https://mineru.net/api/v4/extract/task",
    "apiKey": "your-mineru-token"
  }
}

# 3. Drop files into /path/to/knowledge
#    โ†’ Auto-indexed. No action needed.

# 4. Search via OpenClaw tool
kb_search(query="Q2 revenue forecast", fileType="xlsx")
```

---

## ๐Ÿ“ License

AGPL v3 ยฉ [njuboy11](https://github.com/njuboy11)

> **A small ark that holds your world.** ๐Ÿ›๏ธ
tools

Comments

Sign in to leave a comment

Loading comments...