← Back to Plugins
Channels

Wa Archive

regevguym By regevguym 👁 3 views ▲ 0 votes

OpenClaw plugin: WhatsApp message archive with FTS5 + semantic search

GitHub

Install

npm install
npm

Configuration Example

{
  "plugins": {
    "load": {
      "paths": [
        "/path/to/openclaw-wa-archive"  // absolute path to the cloned repo
      ]
    },
    "entries": {
      "wa-archive": {
        "enabled": true,
        "config": {
          "dataDir": "~/.openclaw/data/wa-archive",
          "enableEmbeddings": true,
          "mediaDownload": true,
          "allowFrom": ["+972XXXXXXXXX"]  // phone numbers allowed to use search tools
        }
      }
    }
  }
}

README

# openclaw-wa-archive

OpenClaw plugin that archives **all** WhatsApp messages (inbound + outbound) into a SQLite database with full-text search (FTS5) and semantic/vector search (sqlite-vec + OpenAI embeddings).

## Features

- ๐Ÿ“ฅ **Auto-archive** โ€” hooks into WhatsApp message events, stores every message automatically
- ๐Ÿ” **Full-text search** โ€” FTS5 for fast keyword search across all conversations
- ๐Ÿง  **Semantic search** โ€” OpenAI embeddings + sqlite-vec for meaning-based search
- ๐Ÿ”€ **Hybrid search** โ€” combines FTS + semantic for best results (default mode)
- ๐Ÿ“Š **Stats tool** โ€” message counts, per-chat/sender breakdowns, busiest hours
- ๐Ÿ“Ž **Media download** โ€” saves images, videos, audio, documents locally
- ๐Ÿ”„ **Backfill** โ€” import existing JSONL session transcripts from OpenClaw history
- ๐Ÿ” **Access control** โ€” restrict search tools to specific phone numbers

## Installation

### 1. Clone the repo

```bash
git clone https://github.com/regevguym/openclaw-wa-archive.git
cd openclaw-wa-archive
npm install
npm run build
```

### 2. Configure OpenClaw

Add to your `openclaw.json`:

```jsonc
{
  "plugins": {
    "load": {
      "paths": [
        "/path/to/openclaw-wa-archive"  // absolute path to the cloned repo
      ]
    },
    "entries": {
      "wa-archive": {
        "enabled": true,
        "config": {
          "dataDir": "~/.openclaw/data/wa-archive",
          "enableEmbeddings": true,
          "mediaDownload": true,
          "allowFrom": ["+972XXXXXXXXX"]  // phone numbers allowed to use search tools
        }
      }
    }
  }
}
```

### 3. Restart OpenClaw

```bash
openclaw gateway restart
```

The plugin will create the database and start archiving messages immediately.

## Configuration

| Option | Type | Default | Description |
|--------|------|---------|-------------|
| `dataDir` | string | `~/.openclaw/data/wa-archive` | Directory for SQLite DB and media files |
| `enableEmbeddings` | boolean | `true` | Enable semantic search via OpenAI embeddings |
| `embeddingModel` | string | `text-embedding-3-small` | OpenAI embedding model (1536 dimensions) |
| `mediaDownload` | boolean | `true` | Download and store media files locally |
| `allowFrom` | string[] | `[]` | Phone numbers allowed to use wa_search and wa_stats tools |

### Environment Variables

- `OPENAI_API_KEY` โ€” Required for embeddings. The plugin reads this from the environment.

## Tools

### `wa_search`

Search the WhatsApp message archive. Supports full-text, semantic, or hybrid search.

**Parameters:**
- `query` (string, required) โ€” Search query (natural language or keywords)
- `sender` (string) โ€” Filter by sender name or phone number
- `chat` (string) โ€” Filter by chat/group name or ID
- `chat_type` (string) โ€” `"group"` or `"direct"`
- `from_date` (string) โ€” Start date (ISO 8601 or relative: `"2 days ago"`, `"yesterday"`, `"last week"`)
- `to_date` (string) โ€” End date
- `mode` (string) โ€” `"fts"` (full-text), `"semantic"` (vector), `"hybrid"` (both, default)
- `limit` (number) โ€” Max results (default: 20)

### `wa_stats`

Get statistics about the WhatsApp message archive.

**Parameters:**
- `period` (string) โ€” `"today"`, `"week"`, `"month"`, `"all"` (default: `"week"`)
- `chat` (string) โ€” Filter by specific chat name or ID
- `sender` (string) โ€” Filter by specific sender

**Returns:** Total messages, per-chat breakdown, per-sender breakdown, busiest hours, inbound/outbound ratio.

## Backfill โ€” Importing Historical Messages

The plugin can import existing WhatsApp messages from OpenClaw's JSONL session transcripts.

### Automatic (via OpenClaw command)

```
/wa-backfill
```

### How backfill works

1. Scans `~/.openclaw/agents/main/sessions/*.jsonl`
2. Reads `sessions.json` to map session IDs โ†’ session keys (which contain chat metadata)
3. Parses each JSONL file line by line:
   - **User messages** (role: `user`) โ†’ stored as inbound
   - **Assistant messages** with `message` tool calls (action: `send`) โ†’ stored as outbound
   - Direct assistant text replies โ†’ stored as outbound
   - `NO_REPLY` and `HEARTBEAT_OK` messages are skipped
4. Session keys encode chat info: `agent:main:whatsapp:group:[email protected]` or `agent:main:whatsapp:direct:+972...`
5. Inserts are idempotent (`INSERT OR IGNORE`) โ€” safe to run multiple times
6. After import, queues embedding generation for all new messages

### Batch Embedding Script

If you need to (re-)embed all messages (e.g., after backfill or if embeddings were disabled):

```bash
OPENAI_API_KEY=sk-... node scripts/batch-embed.js
```

This script:
- Finds all messages without embeddings
- Processes in batches of 50
- Rate limits with 1s delay between batches
- Handles 429 rate limits with exponential backoff
- Strips metadata envelope from content for cleaner embeddings
- Reports progress as it goes

**Cost:** ~$0.01 for 2,500 messages with `text-embedding-3-small`.

## Database

SQLite database at `{dataDir}/messages.db` with:

- **`messages`** table โ€” all message data (id, chat, sender, content, timestamps, media paths, etc.)
- **`messages_fts`** โ€” FTS5 virtual table for full-text search (auto-synced via triggers)
- **`messages_vec`** โ€” sqlite-vec virtual table for vector similarity search (1536-dim float embeddings)

### Schema highlights

```sql
CREATE TABLE messages (
  id TEXT PRIMARY KEY,
  session_key TEXT,
  chat_id TEXT NOT NULL,
  chat_type TEXT NOT NULL,        -- 'group' or 'direct'
  chat_name TEXT,
  sender_id TEXT,
  sender_name TEXT,
  timestamp INTEGER NOT NULL,     -- Unix ms
  content TEXT,
  media_local_path TEXT,
  media_type TEXT,
  is_from_me INTEGER DEFAULT 0,
  direction TEXT DEFAULT 'inbound',
  channel TEXT DEFAULT 'whatsapp',
  account_id TEXT,
  metadata TEXT                   -- JSON blob
);
```

SQLite is in WAL mode for concurrent read performance.

## Architecture

```
openclaw-wa-archive/
โ”œโ”€โ”€ openclaw.plugin.json    # Plugin manifest
โ”œโ”€โ”€ package.json
โ”œโ”€โ”€ tsconfig.json
โ”œโ”€โ”€ scripts/
โ”‚   โ””โ”€โ”€ batch-embed.js      # Standalone batch embedding script
โ”œโ”€โ”€ src/
โ”‚   โ”œโ”€โ”€ index.ts             # Plugin entry โ€” register() hooks, tools, commands
โ”‚   โ”œโ”€โ”€ db.ts                # SQLite setup, migrations, helpers
โ”‚   โ”œโ”€โ”€ ingest.ts            # Message event โ†’ DB row
โ”‚   โ”œโ”€โ”€ embeddings.ts        # OpenAI embedding generation (async, batched)
โ”‚   โ”œโ”€โ”€ media.ts             # Media download + local storage
โ”‚   โ”œโ”€โ”€ backfill.ts          # JSONL transcript โ†’ DB import
โ”‚   โ””โ”€โ”€ tools/
โ”‚       โ”œโ”€โ”€ wa-search.ts     # wa_search tool
โ”‚       โ””โ”€โ”€ wa-stats.ts      # wa_stats tool
โ””โ”€โ”€ data/                    # Created at runtime (gitignored)
    โ”œโ”€โ”€ messages.db          # SQLite database
    โ””โ”€โ”€ media/               # Downloaded media files
```

## Important Notes

- **Graceful degradation** โ€” If sqlite-vec fails to load, falls back to FTS-only search. If OpenAI API is unavailable, skips embeddings but still stores messages.
- **Non-blocking** โ€” Embedding generation and media downloads happen asynchronously, never block message processing.
- **Idempotent** โ€” All inserts use `INSERT OR IGNORE`. Safe to re-run backfill.
- **Performance** โ€” WAL mode, batched inserts, indexed on chat_id, sender_id, timestamp.
- **Access control** โ€” Tools check sender against `allowFrom` config. Only listed phone numbers can search.

## Troubleshooting

### Embeddings not working
- Check `OPENAI_API_KEY` is set in environment
- Verify with: `echo $OPENAI_API_KEY`
- Run `node scripts/batch-embed.js` manually to see errors

### sqlite-vec not loading
- Ensure `npm install` completed (installs `sqlite-vec` and `better-sqlite3`)
- On some platforms, `better-sqlite3` needs native compilation: `npm rebuild better-sqlite3`

### No messages appearing
- Check OpenClaw logs for `[wa-archive] Plugin loaded successfully`
- Verify the plugin path in `openclaw.json` is correct
- Ensure `wa-archive` is in `plugins.entries` with `enabled: true`

## License

MIT
channels

Comments

Sign in to leave a comment

Loading comments...