Tools
Plugin Doc Engine
Semantic documentation engine plugin for OpenClaw โ TF-IDF search across multi-repo docs with incremental indexing
Install
npm install &&
Configuration Example
{
"plugins": {
"enabled": true,
"load": {
"paths": ["~/.openclaw/plugins/doc-engine"]
},
"entries": {
"doc-engine": {
"enabled": true,
"config": {
"repositories": [
{
"name": "official-docs",
"path": "/path/to/your/openclaw-docs",
"priority": 1,
"type": "core",
"glob": "**/*.md"
}
]
}
}
}
}
}
README
# OpenClaw Plugin: Doc Engine
**Multi-repo semantic documentation retrieval for OpenClaw** โ indexes your Markdown documentation across multiple repositories and provides fast, offline semantic search via TF-IDF vectorization. No external APIs required.
[](LICENSE)
[](https://nodejs.org/)
---
## Installation
### Quick Install (Recommended)
```bash
curl -fsSL https://raw.githubusercontent.com/Muminur/openclaw-plugin-doc-engine/main/scripts/install.sh | bash
```
### Manual Installation
```bash
git clone https://github.com/Muminur/openclaw-plugin-doc-engine.git ~/.openclaw/plugins/doc-engine
cd ~/.openclaw/plugins/doc-engine
npm install && npm run build
```
### Configure OpenClaw
Add the plugin to your `~/.openclaw/openclaw.json`:
```json
{
"plugins": {
"enabled": true,
"load": {
"paths": ["~/.openclaw/plugins/doc-engine"]
},
"entries": {
"doc-engine": {
"enabled": true,
"config": {
"repositories": [
{
"name": "official-docs",
"path": "/path/to/your/openclaw-docs",
"priority": 1,
"type": "core",
"glob": "**/*.md"
}
]
}
}
}
}
}
```
You can add multiple repositories with different priorities. Lower priority numbers indicate higher authority โ when two repos contain documentation on the same topic, the higher-authority source wins during conflict resolution.
### Activate
Restart the gateway to load the plugin:
```bash
openclaw gateway restart
```
The engine will scan, chunk, and index all configured repositories on startup. Subsequent restarts use persisted state and only re-index changed files.
---
## Architecture
```
Markdown Files --> Scanner --> Chunker --> TF-IDF --> Vector Store --> Search API
| | |
Secret Redact Vocabulary Fit Cosine Sim
```
### Design Principles
- **Multi-repo indexing** โ Index documentation from any number of repositories simultaneously, each with configurable priority and type (`core` or `extension`).
- **TF-IDF vectorization** โ Fully offline semantic search. No external embedding API calls, no network dependency. The vocabulary is fitted across all indexed documents.
- **SHA256 incremental indexing** โ File hashes are tracked between runs. Only added or changed files are re-processed; unchanged files are skipped entirely.
- **Chunk persistence** โ Chunks, vectors, TF-IDF model state, and file hashes are all persisted to disk (`chunks.json`, `vectors.json`, `tfidf.json`, `hashes.json`). Restarts are near-instant.
- **Secret scanning** โ Configurable regex patterns detect and redact credentials, API keys, and tokens before content enters the index. Prevents accidental credential leakage.
- **OpenClaw Plugin API integration** โ Registers four extension points: a background service (`doc-indexer`), a tool (`semantic_doc_search`), a CLI command group (`docsearch`), and a chat command (`/docsearch`).
---
## Module Reference
### Entry Point
#### `src/index.ts` โ Plugin Registration
The plugin entry point. Exports a default plugin object that registers all four extension points with the OpenClaw Plugin API:
| Extension | Type | Identifier |
|-----------|------|------------|
| Background service | `registerService` | `doc-indexer` |
| Tool | `registerTool` | `semantic_doc_search` |
| CLI commands | `registerCli` | `docsearch` |
| Chat command | `registerCommand` | `/docsearch` |
The service starts the engine on gateway boot and stops it on shutdown. The tool, CLI, and command all delegate to the engine's `search()` method.
---
### Core Engine
#### `src/engine.ts` โ Engine Orchestrator
The central coordinator. Exposes the `DocEngine` interface:
```typescript
interface DocEngine {
start(): Promise<void>; // Load persisted state, run incremental index
stop(): Promise<void>; // Persist state to disk
search(query: string, opts?: SearchOptions): Promise<SearchResult[]>;
index(full?: boolean): Promise<IndexStats>;
getStats(): IndexStats;
}
```
The `start()` lifecycle: load saved TF-IDF model, vectors, and chunks from disk, then run an incremental index pass. The `search()` pipeline: embed query via TF-IDF, retrieve candidates from vector store (over-fetching 2x for dedup headroom), map to `SearchResult` objects, filter zero-score results, apply conflict resolution, and return the top K.
When the TF-IDF vocabulary grows (new documents introduce new terms), the engine automatically re-embeds all existing vectors to match the new dimensionality.
#### `src/types.ts` โ Type Definitions
All TypeScript interfaces used across the codebase:
| Interface | Purpose |
|-----------|---------|
| `RepoConfig` | Repository configuration (name, path, priority, type, glob) |
| `DocChunk` | A chunk of text with metadata (chunkId, repo, file, sectionPath, hash, tokenCount) |
| `SearchResult` | Extends `DocChunk` with `score`, `repoType`, and `repoPriority` |
| `SearchOptions` | Query options (`topK`, `repoFilter`) |
| `IndexStats` | Index statistics (total chunks, files, per-repo breakdown) |
| `StoredVector` | Vector with associated metadata for storage |
| `FileHashRecord` | File path, hash, repo, and last-indexed timestamp |
| `DocEngineConfig` | Full plugin configuration interface |
| `EmbeddingProvider` | Pluggable embedding interface (`embed`, `embedBatch`, `dimensions`) |
| `HashDiff` | Diff result: arrays of added, changed, removed, and unchanged paths |
---
### Configuration
#### `src/config/defaults.ts` โ Default Configuration
Merges user-provided partial config with sensible defaults. Any field omitted from the plugin config falls back to the default value.
#### `src/config/schema.ts` โ Schema Validation
JSON Schema definition for plugin configuration. Used for validation at load time.
---
### Embeddings
#### `src/embeddings/TfIdfEngine.ts` โ TF-IDF Vectorizer
The core embedding engine. Implements term frequency-inverse document frequency vectorization.
Key operations:
- **`fit(documents: string[])`** โ Build or rebuild the vocabulary from a corpus. Computes IDF weights for all terms. Called during indexing whenever new documents are added.
- **`embed(text: string): number[]`** โ Convert a text string into a TF-IDF vector. Used for both document chunks (at index time) and queries (at search time).
- **`serialize()` / `deserialize()`** โ Save and restore the fitted model (vocabulary + IDF weights) to/from JSON. Enables fast restarts without refitting.
- **`dimensions`** โ The current vocabulary size (number of unique terms).
The vocabulary grows as new documents are indexed. When this happens, all previously stored vectors must be re-embedded to match the new dimensionality โ the engine handles this automatically.
#### `src/embeddings/VectorStore.ts` โ Vector Storage
In-memory vector store with persistence.
- **`upsert(id, entry)`** โ Insert or update a vector by chunk ID.
- **`remove(id)`** โ Delete a vector.
- **`topK(queryVector, k, filter?)`** โ Retrieve the top K most similar vectors via cosine similarity, with optional metadata filtering (e.g., by repo name).
- **`clear()`** โ Wipe all stored vectors (used during full re-index).
- **`load(path)` / `save(path)`** โ Persist vectors to and from a JSON file on disk.
#### `src/embeddings/Similarity.ts` โ Cosine Similarity
Implements the `cosineSimilarity(a, b)` function used by `VectorStore.topK()` to rank search results. Returns a value between 0 (no similarity) and 1 (identical direction).
---
### Indexing
#### `src/indexing/MarkdownChunker.ts` โ Markdown-Aware Chunker
Splits Markdown documents into chunks that respect the heading hierarchy. Each chunk:
- Stays within the configured `chunkMaxTokens` limit (default: 800).
- Preserves its heading path (e.g., `# Configuration > ## Models > ### Fallbacks`) as the `sectionPath` field.
- Gets a deterministic `chunkId` derived from `SHA256(repo + file + sectionPath)`.
This ensures search results carry meaningful section context, not just raw text fragments.
#### `src/indexing/FileHasher.ts` โ SHA256 File Hashing
Handles incremental indexing by tracking file content hashes.
- **`hashFile(path)`** โ Compute the SHA256 hash of a file.
- **`loadHashes(path)` / `saveHashes(path, hashes)`** โ Persist hash records between runs.
- **`diffHashes(stored, current)`** โ Compare stored vs. current hashes. Returns a `HashDiff` with four arrays: `added`, `changed`, `removed`, `unchanged`. Only `added` and `changed` files need re-processing.
#### `src/indexing/IndexBuilder.ts` โ Index Construction Helpers
Utility functions for building and managing the index data structures.
---
### Registry
#### `src/registry/RepoRegistry.ts` โ Repository Registry
Manages the set of configured documentation repositories.
- **`scan()`** โ Walk each repository's file tree using the configured glob pattern (default: `**/*.md`). Returns a `Map<repoName, absolutePaths[]>`.
- **`getRepo(name)`** โ Look up a `RepoConfig` by name.
Supports both `core` and `extension` repository types. Core repositories have higher authority during conflict resolution.
---
### Retrieval
#### `src/retrieval/ConflictResolver.ts` โ Priority-Based Conflict Resolution
When multiple repositories contain documentation on the same topic (overlapping `sectionPath` values), this module deduplicates results:
1. Groups results by normalized section path.
2. Within each group, selects the result from the highest-priority repo (lowest `priority` number).
3. `core` type repos always outrank `extension` type repos at the same priority level.
This ensures authoritative documentation is always surfaced first.
#### `src/retrieval/RetrievalEngine.ts` โ Search Orchestration
Coordinates the retrieval pipeline:
... (truncated)
tools
Comments
Sign in to leave a comment