Tools
Memory Bench
CLI-first, reproducible benchmark toolkit for OpenClaw memory plugins (openclaw-mem, memory-core, memory-lancedb).
README
# openclaw-memory-bench
CLI-first benchmark toolkit for **OpenClaw memory-layer plugins**.
This project is designed to provide a reproducible, non-interactive evaluation workflow for memory plugins such as:
- `openclaw-mem`
- `memory-core` (OpenClaw built-in memory)
- `memory-lancedb`
- `memu-engine-for-OpenClaw`
- future OpenClaw memory-layer plugins
## Why this exists
Public SaaS memory providers already have benchmark visibility. The OpenClaw community needs a neutral, plugin-focused benchmark harness with:
- deterministic retrieval metrics
- optional end-to-end answer/judge metrics
- reproducible run manifests
- machine-readable outputs for CI and community comparison
## Scope (v0.1)
- **Track A: Retrieval benchmark (deterministic)**
- Hit@K, Recall@K, Precision@K, MRR, nDCG
- latency (p50/p95)
- **Track B: End-to-end benchmark (optional)**
- retrieval → answer model → judge model
- accuracy + cost/latency metadata
## Principles
1. **CLI and non-interactive by default**
2. **Reproducibility first** (versioned manifests, pinned config)
3. **Plugin neutrality** (same protocol for all adapters)
4. **Transparent reporting** (JSON outputs + docs)
## Quickstart (retrieval track MVP)
```bash
uv sync
uv run openclaw-memory-bench doctor
# Generate manifest
uv run openclaw-memory-bench plan --provider openclaw-mem --benchmark locomo
# Run deterministic retrieval benchmark using example dataset
uv run openclaw-memory-bench run-retrieval \
--provider openclaw-mem \
--dataset examples/mini_retrieval.json \
--top-k 5
# Run against isolated memory-core profile (no main-system pollution)
uv run openclaw-memory-bench run-retrieval \
--provider memory-core \
--dataset examples/mini_retrieval.json \
--top-k 5 \
--memory-core-profile membench-memory-core
# Prepare canonical dataset (download + convert)
uv run openclaw-memory-bench prepare-dataset \
--benchmark longmemeval \
--limit 50 \
--out data/datasets/longmemeval-50.json
# Writes dataset + sidecar metadata:
# - data/datasets/longmemeval-50.json
# - data/datasets/longmemeval-50.json.meta.json
```
The run command writes a JSON report under `artifacts/<run-id>/retrieval-report.json` by default.
Reports now embed a reproducibility manifest (`report.manifest`) containing toolkit version, git commit, dataset hash/meta, provider config (sanitized), and runtime metadata.
For dataset schema, see `docs/dataset-format.md`.
## Preliminary results snapshot
See `PRELIMINARY_RESULTS.md` for currently available early comparison artifacts and caveats.
## Provider roles and interpretation guardrails (important)
To avoid misleading comparisons, benchmark providers should be read as follows:
- `openclaw-mem` provider in this repo = **standalone sidecar engine run** (`openclaw-mem` CLI ingest/search on benchmark-managed sqlite files).
- It is **not automatically combined** with `memory-core` or `memory-lancedb` in current leaderboard numbers.
- `memory-core` provider = OpenClaw built-in backend (`openclaw memory index/search`) under an isolated profile.
- `memory-lancedb` provider = canonical memory tool path (`memory_store` / `memory_recall` / `memory_forget`) via Gateway invoke.
Current reports are primarily **independent-provider comparisons**. A full **combination matrix** (e.g., sidecar + backend pairings) is tracked as follow-up work.
### memory-lancedb (canonical memory tools)
```bash
uv run openclaw-memory-bench run-retrieval \
--provider memory-lancedb \
--dataset data/datasets/longmemeval-50.json \
--top-k 10 \
--session-key main
```
> This adapter uses `memory_store` + `memory_recall` + `memory_forget` via Gateway invoke.
### memu-engine (gateway mode)
```bash
uv run openclaw-memory-bench run-retrieval \
--provider memu-engine \
--dataset data/datasets/longmemeval-50.json \
--top-k 10 \
--skip-ingest \
--gateway-url http://127.0.0.1:18789
```
> For `memu-engine`, default ingest mode is `noop` (pre-ingested search). Use `--memu-ingest-mode memory_store` only if your memory slot exposes `memory_store`.
### Useful flags for `run-retrieval`
- `--limit N` run first N questions
- `--sample-size N --sample-seed S` deterministic seeded subset sampling
- `--fail-fast` stop on first question failure
- `--db-root <dir>` per-container sqlite storage root for `openclaw-mem`
- `--openclaw-mem-cmd ...` override adapter command base when needed
- `--memory-core-profile <name>` isolated OpenClaw profile for `memory-core`
- `--skip-ingest` run search-only against existing memory state
- `--preindex-once` ingest/index selected dataset once, then run per-question search
- `--gateway-url/--gateway-token` for gateway-backed providers (`memu-engine`, `memory-lancedb`)
- `--memu-ingest-mode noop|memory_store` for memu ingestion strategy
- `--lancedb-recall-limit-factor N` candidate pool multiplier before container filtering
- `--memory-core-index-retries N` + `--memory-core-timeout-sec N` for timeout resilience
- `--memory-core-max-messages-per-session`, `--memory-core-max-message-chars`, `--memory-core-max-chars-per-session` for long-session ingest stabilization
## One-shot two-plugin baseline runner (Phase A)
```bash
scripts/run_two_plugin_baseline.sh \
--profile configs/run-profiles/two-plugin-baseline.json
```
This orchestrator emits both provider reports and merged compare artifacts under `artifacts/full-benchmark/<run-group>/`.
See `docs/PHASE_A_EXECUTION.md` for fallback behavior and fast pilot mode.
## One-shot sidecar pilot (memory-core vs openclaw-mem)
```bash
scripts/run_memory_core_vs_openclaw_mem.sh \
--dataset examples/dual_language_mini.json \
--top-k 5
```
Artifacts are written under `artifacts/sidecar-compare/<run-group>/`.
This path is isolated from the main OpenClaw system via an independent memory-core profile (`membench-memory-core`) and per-run openclaw-mem sqlite roots.
## One-shot comprehensive triplet (memory-core, memory-lancedb, openclaw-mem)
```bash
scripts/run_memory_triplet_comprehensive.sh \
--benchmark longmemeval \
--dataset-limit 100 \
--question-limit 100 \
--top-k 10
```
Artifacts are written under `artifacts/comprehensive-triplet/<run-group>/`.
## Current implementation status
- `openclaw-mem`: retrieval-track adapter implemented (MVP, CLI-driven)
- `memory-core`: retrieval-track adapter implemented (isolated `--profile` mode)
- `memory-lancedb`: gateway-backed adapter implemented for canonical memory tools (`memory_store`/`memory_recall`/`memory_forget`)
- `memu-engine`: gateway-backed adapter implemented for `memory_search` (ingest modes: `noop` / `memory_store`)
- Canonical dataset conversion command available (`prepare-dataset`)
## Project plan and TODOs
- `docs/PROJECT_PLAN_AND_TODOS.md`
- `docs/FULL_BENCHMARK_PLAN.md` (two-plugin full report execution plan)
- `docs/PHASE_A_EXECUTION.md` (locked profile + one-shot baseline runner)
## Project layout
- `src/openclaw_memory_bench/cli.py` — main CLI
- `src/openclaw_memory_bench/protocol.py` — provider adapter protocol
- `src/openclaw_memory_bench/adapters/` — plugin adapters
- `docs/decisions/` — architecture decisions
- `docs/devlog/` — implementation progress logs
## License
MIT License. See `LICENSE`.
## Acknowledgements
This toolkit is inspired by the benchmark design ideas from:
- <https://github.com/supermemoryai/memorybench> (MIT)
When specific code-level adaptations are introduced, they will be explicitly documented in `ACKNOWLEDGEMENTS.md` with file-level references.
tools
Comments
Sign in to leave a comment