← Back to Plugins
Tools

Plugin Doc Engine

Muminur By Muminur 👁 107 views ▲ 0 votes

Semantic documentation engine plugin for OpenClaw โ€” TF-IDF search across multi-repo docs with incremental indexing

GitHub

Install

npm install &&

Configuration Example

{
  "plugins": {
    "enabled": true,
    "load": {
      "paths": ["~/.openclaw/plugins/doc-engine"]
    },
    "entries": {
      "doc-engine": {
        "enabled": true,
        "config": {
          "repositories": [
            {
              "name": "official-docs",
              "path": "/path/to/your/openclaw-docs",
              "priority": 1,
              "type": "core",
              "glob": "**/*.md"
            }
          ]
        }
      }
    }
  }
}

README

# OpenClaw Plugin: Doc Engine

**Multi-repo semantic documentation retrieval for OpenClaw** โ€” indexes your Markdown documentation across multiple repositories and provides fast, offline semantic search via TF-IDF vectorization. No external APIs required.

[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)
[![Node.js](https://img.shields.io/badge/node-%3E%3D20-brightgreen.svg)](https://nodejs.org/)

---

## Installation

### Quick Install (Recommended)

```bash
curl -fsSL https://raw.githubusercontent.com/Muminur/openclaw-plugin-doc-engine/main/scripts/install.sh | bash
```

### Manual Installation

```bash
git clone https://github.com/Muminur/openclaw-plugin-doc-engine.git ~/.openclaw/plugins/doc-engine
cd ~/.openclaw/plugins/doc-engine
npm install && npm run build
```

### Configure OpenClaw

Add the plugin to your `~/.openclaw/openclaw.json`:

```json
{
  "plugins": {
    "enabled": true,
    "load": {
      "paths": ["~/.openclaw/plugins/doc-engine"]
    },
    "entries": {
      "doc-engine": {
        "enabled": true,
        "config": {
          "repositories": [
            {
              "name": "official-docs",
              "path": "/path/to/your/openclaw-docs",
              "priority": 1,
              "type": "core",
              "glob": "**/*.md"
            }
          ]
        }
      }
    }
  }
}
```

You can add multiple repositories with different priorities. Lower priority numbers indicate higher authority โ€” when two repos contain documentation on the same topic, the higher-authority source wins during conflict resolution.

### Activate

Restart the gateway to load the plugin:

```bash
openclaw gateway restart
```

The engine will scan, chunk, and index all configured repositories on startup. Subsequent restarts use persisted state and only re-index changed files.

---

## Architecture

```
Markdown Files --> Scanner --> Chunker --> TF-IDF --> Vector Store --> Search API
                     |                      |             |
               Secret Redact         Vocabulary Fit   Cosine Sim
```

### Design Principles

- **Multi-repo indexing** โ€” Index documentation from any number of repositories simultaneously, each with configurable priority and type (`core` or `extension`).
- **TF-IDF vectorization** โ€” Fully offline semantic search. No external embedding API calls, no network dependency. The vocabulary is fitted across all indexed documents.
- **SHA256 incremental indexing** โ€” File hashes are tracked between runs. Only added or changed files are re-processed; unchanged files are skipped entirely.
- **Chunk persistence** โ€” Chunks, vectors, TF-IDF model state, and file hashes are all persisted to disk (`chunks.json`, `vectors.json`, `tfidf.json`, `hashes.json`). Restarts are near-instant.
- **Secret scanning** โ€” Configurable regex patterns detect and redact credentials, API keys, and tokens before content enters the index. Prevents accidental credential leakage.
- **OpenClaw Plugin API integration** โ€” Registers four extension points: a background service (`doc-indexer`), a tool (`semantic_doc_search`), a CLI command group (`docsearch`), and a chat command (`/docsearch`).

---

## Module Reference

### Entry Point

#### `src/index.ts` โ€” Plugin Registration

The plugin entry point. Exports a default plugin object that registers all four extension points with the OpenClaw Plugin API:

| Extension | Type | Identifier |
|-----------|------|------------|
| Background service | `registerService` | `doc-indexer` |
| Tool | `registerTool` | `semantic_doc_search` |
| CLI commands | `registerCli` | `docsearch` |
| Chat command | `registerCommand` | `/docsearch` |

The service starts the engine on gateway boot and stops it on shutdown. The tool, CLI, and command all delegate to the engine's `search()` method.

---

### Core Engine

#### `src/engine.ts` โ€” Engine Orchestrator

The central coordinator. Exposes the `DocEngine` interface:

```typescript
interface DocEngine {
  start(): Promise<void>;    // Load persisted state, run incremental index
  stop(): Promise<void>;     // Persist state to disk
  search(query: string, opts?: SearchOptions): Promise<SearchResult[]>;
  index(full?: boolean): Promise<IndexStats>;
  getStats(): IndexStats;
}
```

The `start()` lifecycle: load saved TF-IDF model, vectors, and chunks from disk, then run an incremental index pass. The `search()` pipeline: embed query via TF-IDF, retrieve candidates from vector store (over-fetching 2x for dedup headroom), map to `SearchResult` objects, filter zero-score results, apply conflict resolution, and return the top K.

When the TF-IDF vocabulary grows (new documents introduce new terms), the engine automatically re-embeds all existing vectors to match the new dimensionality.

#### `src/types.ts` โ€” Type Definitions

All TypeScript interfaces used across the codebase:

| Interface | Purpose |
|-----------|---------|
| `RepoConfig` | Repository configuration (name, path, priority, type, glob) |
| `DocChunk` | A chunk of text with metadata (chunkId, repo, file, sectionPath, hash, tokenCount) |
| `SearchResult` | Extends `DocChunk` with `score`, `repoType`, and `repoPriority` |
| `SearchOptions` | Query options (`topK`, `repoFilter`) |
| `IndexStats` | Index statistics (total chunks, files, per-repo breakdown) |
| `StoredVector` | Vector with associated metadata for storage |
| `FileHashRecord` | File path, hash, repo, and last-indexed timestamp |
| `DocEngineConfig` | Full plugin configuration interface |
| `EmbeddingProvider` | Pluggable embedding interface (`embed`, `embedBatch`, `dimensions`) |
| `HashDiff` | Diff result: arrays of added, changed, removed, and unchanged paths |

---

### Configuration

#### `src/config/defaults.ts` โ€” Default Configuration

Merges user-provided partial config with sensible defaults. Any field omitted from the plugin config falls back to the default value.

#### `src/config/schema.ts` โ€” Schema Validation

JSON Schema definition for plugin configuration. Used for validation at load time.

---

### Embeddings

#### `src/embeddings/TfIdfEngine.ts` โ€” TF-IDF Vectorizer

The core embedding engine. Implements term frequency-inverse document frequency vectorization.

Key operations:

- **`fit(documents: string[])`** โ€” Build or rebuild the vocabulary from a corpus. Computes IDF weights for all terms. Called during indexing whenever new documents are added.
- **`embed(text: string): number[]`** โ€” Convert a text string into a TF-IDF vector. Used for both document chunks (at index time) and queries (at search time).
- **`serialize()` / `deserialize()`** โ€” Save and restore the fitted model (vocabulary + IDF weights) to/from JSON. Enables fast restarts without refitting.
- **`dimensions`** โ€” The current vocabulary size (number of unique terms).

The vocabulary grows as new documents are indexed. When this happens, all previously stored vectors must be re-embedded to match the new dimensionality โ€” the engine handles this automatically.

#### `src/embeddings/VectorStore.ts` โ€” Vector Storage

In-memory vector store with persistence.

- **`upsert(id, entry)`** โ€” Insert or update a vector by chunk ID.
- **`remove(id)`** โ€” Delete a vector.
- **`topK(queryVector, k, filter?)`** โ€” Retrieve the top K most similar vectors via cosine similarity, with optional metadata filtering (e.g., by repo name).
- **`clear()`** โ€” Wipe all stored vectors (used during full re-index).
- **`load(path)` / `save(path)`** โ€” Persist vectors to and from a JSON file on disk.

#### `src/embeddings/Similarity.ts` โ€” Cosine Similarity

Implements the `cosineSimilarity(a, b)` function used by `VectorStore.topK()` to rank search results. Returns a value between 0 (no similarity) and 1 (identical direction).

---

### Indexing

#### `src/indexing/MarkdownChunker.ts` โ€” Markdown-Aware Chunker

Splits Markdown documents into chunks that respect the heading hierarchy. Each chunk:

- Stays within the configured `chunkMaxTokens` limit (default: 800).
- Preserves its heading path (e.g., `# Configuration > ## Models > ### Fallbacks`) as the `sectionPath` field.
- Gets a deterministic `chunkId` derived from `SHA256(repo + file + sectionPath)`.

This ensures search results carry meaningful section context, not just raw text fragments.

#### `src/indexing/FileHasher.ts` โ€” SHA256 File Hashing

Handles incremental indexing by tracking file content hashes.

- **`hashFile(path)`** โ€” Compute the SHA256 hash of a file.
- **`loadHashes(path)` / `saveHashes(path, hashes)`** โ€” Persist hash records between runs.
- **`diffHashes(stored, current)`** โ€” Compare stored vs. current hashes. Returns a `HashDiff` with four arrays: `added`, `changed`, `removed`, `unchanged`. Only `added` and `changed` files need re-processing.

#### `src/indexing/IndexBuilder.ts` โ€” Index Construction Helpers

Utility functions for building and managing the index data structures.

---

### Registry

#### `src/registry/RepoRegistry.ts` โ€” Repository Registry

Manages the set of configured documentation repositories.

- **`scan()`** โ€” Walk each repository's file tree using the configured glob pattern (default: `**/*.md`). Returns a `Map<repoName, absolutePaths[]>`.
- **`getRepo(name)`** โ€” Look up a `RepoConfig` by name.

Supports both `core` and `extension` repository types. Core repositories have higher authority during conflict resolution.

---

### Retrieval

#### `src/retrieval/ConflictResolver.ts` โ€” Priority-Based Conflict Resolution

When multiple repositories contain documentation on the same topic (overlapping `sectionPath` values), this module deduplicates results:

1. Groups results by normalized section path.
2. Within each group, selects the result from the highest-priority repo (lowest `priority` number).
3. `core` type repos always outrank `extension` type repos at the same priority level.

This ensures authoritative documentation is always surfaced first.

#### `src/retrieval/RetrievalEngine.ts` โ€” Search Orchestration

Coordinates the retrieval pipeline: 

... (truncated)
tools

Comments

Sign in to leave a comment

Loading comments...