Channels
Wa Archive
OpenClaw plugin: WhatsApp message archive with FTS5 + semantic search
Install
npm install
npm
Configuration Example
{
"plugins": {
"load": {
"paths": [
"/path/to/openclaw-wa-archive" // absolute path to the cloned repo
]
},
"entries": {
"wa-archive": {
"enabled": true,
"config": {
"dataDir": "~/.openclaw/data/wa-archive",
"enableEmbeddings": true,
"mediaDownload": true,
"allowFrom": ["+972XXXXXXXXX"] // phone numbers allowed to use search tools
}
}
}
}
}
README
# openclaw-wa-archive
OpenClaw plugin that archives **all** WhatsApp messages (inbound + outbound) into a SQLite database with full-text search (FTS5) and semantic/vector search (sqlite-vec + OpenAI embeddings).
## Features
- ๐ฅ **Auto-archive** โ hooks into WhatsApp message events, stores every message automatically
- ๐ **Full-text search** โ FTS5 for fast keyword search across all conversations
- ๐ง **Semantic search** โ OpenAI embeddings + sqlite-vec for meaning-based search
- ๐ **Hybrid search** โ combines FTS + semantic for best results (default mode)
- ๐ **Stats tool** โ message counts, per-chat/sender breakdowns, busiest hours
- ๐ **Media download** โ saves images, videos, audio, documents locally
- ๐ **Backfill** โ import existing JSONL session transcripts from OpenClaw history
- ๐ **Access control** โ restrict search tools to specific phone numbers
## Installation
### 1. Clone the repo
```bash
git clone https://github.com/regevguym/openclaw-wa-archive.git
cd openclaw-wa-archive
npm install
npm run build
```
### 2. Configure OpenClaw
Add to your `openclaw.json`:
```jsonc
{
"plugins": {
"load": {
"paths": [
"/path/to/openclaw-wa-archive" // absolute path to the cloned repo
]
},
"entries": {
"wa-archive": {
"enabled": true,
"config": {
"dataDir": "~/.openclaw/data/wa-archive",
"enableEmbeddings": true,
"mediaDownload": true,
"allowFrom": ["+972XXXXXXXXX"] // phone numbers allowed to use search tools
}
}
}
}
}
```
### 3. Restart OpenClaw
```bash
openclaw gateway restart
```
The plugin will create the database and start archiving messages immediately.
## Configuration
| Option | Type | Default | Description |
|--------|------|---------|-------------|
| `dataDir` | string | `~/.openclaw/data/wa-archive` | Directory for SQLite DB and media files |
| `enableEmbeddings` | boolean | `true` | Enable semantic search via OpenAI embeddings |
| `embeddingModel` | string | `text-embedding-3-small` | OpenAI embedding model (1536 dimensions) |
| `mediaDownload` | boolean | `true` | Download and store media files locally |
| `allowFrom` | string[] | `[]` | Phone numbers allowed to use wa_search and wa_stats tools |
### Environment Variables
- `OPENAI_API_KEY` โ Required for embeddings. The plugin reads this from the environment.
## Tools
### `wa_search`
Search the WhatsApp message archive. Supports full-text, semantic, or hybrid search.
**Parameters:**
- `query` (string, required) โ Search query (natural language or keywords)
- `sender` (string) โ Filter by sender name or phone number
- `chat` (string) โ Filter by chat/group name or ID
- `chat_type` (string) โ `"group"` or `"direct"`
- `from_date` (string) โ Start date (ISO 8601 or relative: `"2 days ago"`, `"yesterday"`, `"last week"`)
- `to_date` (string) โ End date
- `mode` (string) โ `"fts"` (full-text), `"semantic"` (vector), `"hybrid"` (both, default)
- `limit` (number) โ Max results (default: 20)
### `wa_stats`
Get statistics about the WhatsApp message archive.
**Parameters:**
- `period` (string) โ `"today"`, `"week"`, `"month"`, `"all"` (default: `"week"`)
- `chat` (string) โ Filter by specific chat name or ID
- `sender` (string) โ Filter by specific sender
**Returns:** Total messages, per-chat breakdown, per-sender breakdown, busiest hours, inbound/outbound ratio.
## Backfill โ Importing Historical Messages
The plugin can import existing WhatsApp messages from OpenClaw's JSONL session transcripts.
### Automatic (via OpenClaw command)
```
/wa-backfill
```
### How backfill works
1. Scans `~/.openclaw/agents/main/sessions/*.jsonl`
2. Reads `sessions.json` to map session IDs โ session keys (which contain chat metadata)
3. Parses each JSONL file line by line:
- **User messages** (role: `user`) โ stored as inbound
- **Assistant messages** with `message` tool calls (action: `send`) โ stored as outbound
- Direct assistant text replies โ stored as outbound
- `NO_REPLY` and `HEARTBEAT_OK` messages are skipped
4. Session keys encode chat info: `agent:main:whatsapp:group:[email protected]` or `agent:main:whatsapp:direct:+972...`
5. Inserts are idempotent (`INSERT OR IGNORE`) โ safe to run multiple times
6. After import, queues embedding generation for all new messages
### Batch Embedding Script
If you need to (re-)embed all messages (e.g., after backfill or if embeddings were disabled):
```bash
OPENAI_API_KEY=sk-... node scripts/batch-embed.js
```
This script:
- Finds all messages without embeddings
- Processes in batches of 50
- Rate limits with 1s delay between batches
- Handles 429 rate limits with exponential backoff
- Strips metadata envelope from content for cleaner embeddings
- Reports progress as it goes
**Cost:** ~$0.01 for 2,500 messages with `text-embedding-3-small`.
## Database
SQLite database at `{dataDir}/messages.db` with:
- **`messages`** table โ all message data (id, chat, sender, content, timestamps, media paths, etc.)
- **`messages_fts`** โ FTS5 virtual table for full-text search (auto-synced via triggers)
- **`messages_vec`** โ sqlite-vec virtual table for vector similarity search (1536-dim float embeddings)
### Schema highlights
```sql
CREATE TABLE messages (
id TEXT PRIMARY KEY,
session_key TEXT,
chat_id TEXT NOT NULL,
chat_type TEXT NOT NULL, -- 'group' or 'direct'
chat_name TEXT,
sender_id TEXT,
sender_name TEXT,
timestamp INTEGER NOT NULL, -- Unix ms
content TEXT,
media_local_path TEXT,
media_type TEXT,
is_from_me INTEGER DEFAULT 0,
direction TEXT DEFAULT 'inbound',
channel TEXT DEFAULT 'whatsapp',
account_id TEXT,
metadata TEXT -- JSON blob
);
```
SQLite is in WAL mode for concurrent read performance.
## Architecture
```
openclaw-wa-archive/
โโโ openclaw.plugin.json # Plugin manifest
โโโ package.json
โโโ tsconfig.json
โโโ scripts/
โ โโโ batch-embed.js # Standalone batch embedding script
โโโ src/
โ โโโ index.ts # Plugin entry โ register() hooks, tools, commands
โ โโโ db.ts # SQLite setup, migrations, helpers
โ โโโ ingest.ts # Message event โ DB row
โ โโโ embeddings.ts # OpenAI embedding generation (async, batched)
โ โโโ media.ts # Media download + local storage
โ โโโ backfill.ts # JSONL transcript โ DB import
โ โโโ tools/
โ โโโ wa-search.ts # wa_search tool
โ โโโ wa-stats.ts # wa_stats tool
โโโ data/ # Created at runtime (gitignored)
โโโ messages.db # SQLite database
โโโ media/ # Downloaded media files
```
## Important Notes
- **Graceful degradation** โ If sqlite-vec fails to load, falls back to FTS-only search. If OpenAI API is unavailable, skips embeddings but still stores messages.
- **Non-blocking** โ Embedding generation and media downloads happen asynchronously, never block message processing.
- **Idempotent** โ All inserts use `INSERT OR IGNORE`. Safe to re-run backfill.
- **Performance** โ WAL mode, batched inserts, indexed on chat_id, sender_id, timestamp.
- **Access control** โ Tools check sender against `allowFrom` config. Only listed phone numbers can search.
## Troubleshooting
### Embeddings not working
- Check `OPENAI_API_KEY` is set in environment
- Verify with: `echo $OPENAI_API_KEY`
- Run `node scripts/batch-embed.js` manually to see errors
### sqlite-vec not loading
- Ensure `npm install` completed (installs `sqlite-vec` and `better-sqlite3`)
- On some platforms, `better-sqlite3` needs native compilation: `npm rebuild better-sqlite3`
### No messages appearing
- Check OpenClaw logs for `[wa-archive] Plugin loaded successfully`
- Verify the plugin path in `openclaw.json` is correct
- Ensure `wa-archive` is in `plugins.entries` with `enabled: true`
## License
MIT
channels
Comments
Sign in to leave a comment