Identity Plane

Name: Identity Plane
Rating: 3.5 (1 reviews)
Author: intrinsec-ai

By intrinsec-ai 👁 91 views ▲ 0 votes

Identity Plane for OpenClaw agents that surfaces off-identity behavior and compacts identity over time.

Homepage GitHub

Install

npm install @intrinsec-ai/openclaw-identity-plane

Configuration Example

{
  "plugins": {
    "load": { "paths": ["~/.openclaw/extensions/identity"] },
    "entries": { "identity": { "enabled": true } }
  }
}

README



# 🧠 `@intrinsec-ai/openclaw-identity-plane` — Identity Plane Plugin

**An identity watchdog for your AI agents.**  
Track whether your agent stays true to its declared identity — across every message, not just the first.

[Plugin](https://docs.openclaw.ai/)
[Node](https://nodejs.org/)
[License](./LICENSE)
[Status]()



---

> **Demo notice.** This is a lightweight reference implementation — intentionally observational and read-only. It logs, visualises, and annotates drift but does not block or intercept anything. Features like active intervention, blocking writes, or fleet correlation can be added based on community feedback.

---

## The core insight

> *Every token entering your agent's context window is a vote on its identity.*  
> *This plugin watches for votes that shouldn't be there.*

Prompt injection, jailbreaks, and malicious skill installations (like [ClawHavoc](https://www.koi.ai/blog/clawhavoc-341-malicious-clawedbot-skills-found-by-the-bot-they-were-targeting)) all work the same way: they deflect the agent's behavioural trajectory away from its declared identity. The identity plane makes that deflection visible.

---

## What it does


|                 |                                                                                                                                                              |
| --------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| 🔍 **Detects**  | Prompt injection · jailbreaks · off-topic drift · malicious skill installs                                                                                   |
| ⚙️ **How**      | Scores each message against your agent's compliance model; off-identity signals accumulate in bounded memory without cancelling each other out               |
| 📊 **Signals**  | `maxDrift` · `avgDrift` · drift themes · **cognitive file change annotations** (so you know whether a spike came *before* or *after* a file changed on disk) |
| ☁️ **Stack**    | **OpenAI-only:** `text-embedding-3-small` on every inbound message, plus the same LLM you configure for calibration, compaction, and novel-theme labels      |
| 👁️ **Posture** | Read-only and observational. Nothing is blocked.                                                                                                             |


---

## How it works

### 1 · Calibration *(once, then cached)*

When you first run an agent (or run `/identity recalibrate`):

1. **Reads your cognitive files** — `SOUL.md`, `AGENTS.md`, `IDENTITY.md`, `USER.md`, `TOOLS.md`
2. **Extracts identity anchors** — the LLM distills the raw files into 5–10 named, categorised anchors: `purpose`, `value`, `boundary`, `persona`, `constraint`. These are specific, traceable aspects of identity (e.g. *[BOUNDARY] No credential access: The agent must never read or transmit API keys or secrets.*) rather than a raw policy dump.
3. **Generates labelled examples grounded in the anchors** — the same LLM produces `max(20, 3 × numAnchors)` compliant and violation sentences. The count scales with identity complexity (e.g. 8 anchors → 24 examples each) so every anchor is represented multiple times, giving the manifold better covariance coverage. The prompt asks for at least one compliant and one violation per anchor.
4. **Both sets are embedded** with `text-embedding-3-small` (same model as runtime — keeps geometry consistent)
5. **Fits a Gaussian** to the compliant cluster — mean and covariance via PCA, producing the **identity manifold**
6. **Sets the surprise threshold** — the 90th-percentile Mahalanobis distance of compliant examples. With ≥20 examples the empirical distribution is reliable enough that the 95th percentile over-fits to compliant outliers; 90th is a tighter, more stable gate.

The identity anchors are the canonical representation of the agent's identity from this point on. Every downstream LLM call — compaction audits, novel theme labels — references the named anchors rather than the raw file contents.

Everything is cached to `.openclaw/identity-cache.json`. Nothing is recomputed until you explicitly recalibrate.

> **Upgrading from an older build:** run `/identity recalibrate` — the cache schema now includes identity anchors and old caches without them will trigger automatic re-calibration.

---

### 2 · Runtime *(every inbound message)*

Every piece of text entering the LLM — user messages, tool results, subagent replies — flows through:

```
  message ──► embed (OpenAI) ──► surprise = ½·Mahalanobis²(x, compliance)
                                        │
                          ┌─────────────▼─────────────┐
                          │  weight = max(0, s−τ)     │
                          └─────────────┬─────────────┘
                                        │
              ┌─────────────────────────▼────────────────────────┐
              │  weight > 0?                                     │
              │   → ingest into coreset                          │
              │   → novelty check → new mode? → LLM theme label  │
              │                                                  │
              │  weight = 0? → skip (identity-neutral)           │
              └─────────────────────────┬────────────────────────┘
                                        ▼
              drift scores from coreset ──► maxDrift / avgDrift ──► level
```

**Drift levels:** `nominal` · `warning` · `alert`

Without `OPENAI_API_KEY`, embedding fails and the hook no-ops with a logged error. Novel-theme labeling falls back to **geometric** labels if the key is missing at runtime; compaction is skipped without a key.

---

### 3 · Identity compaction *(LLM audit)*

The coreset records *where* drift lives geometrically. **Identity compaction** turns that into a human-readable story grounded in the named identity anchors.

Every `compactionInterval` off-identity messages (after ingestion, i.e. `surpriseWeight > 0`), when `OPENAI_API_KEY` is set:

1. **Clusters** coreset points via connected-component BFS
2. **Describes** each cluster using the actual stored message texts
3. **Asks the LLM** for theme labels, severities, and a 1–2 sentence narrative — with the identity anchors list in the prompt so every theme and the narrative explicitly references named aspects of the declared identity
4. **Replaces** the theme list with the consolidated analysis

The compaction is **non-destructive** — the coreset (geometric ground truth) is never rewritten by the LLM.

**Novel drift modes** get a one-shot LLM label (triggering message + nearest calibration violations) whenever the key is present; otherwise the geometric fallback is used.

---

### 4 · The coreset — why signals don't cancel

Unlike a running mean, the **bounded weighted coreset** (128 points by default) keeps distinct drift directions separate:

- Opposing drifts cannot cancel each other out
- Old signals aren't lost to exponential forgetting
- When full, the *closest pair* merges — preserving distinct drift modes while bounding memory

Each coreset point carries the **original message texts** that contributed to it, so drift themes and audit narratives are grounded in real behaviour, not just geometry.

---

### 5 · Why distance-from-compliance works

When an agent has a well-defined goal — *"help with code"*, *"answer HR questions"* — its compliant responses form a **tight cluster** in embedding space. Violations live in the open-ended complement.

```
     Embedding space (conceptual 2D projection)
     ┌──────────────────────────────────────────┐
     │  ✗          ✗                 ✗          │
     │       ✗          ┌─────────┐      ✗      │
     │  ✗               │ ● ● ● ● │             │
     │                  │ ● ● ● ● │  ✗          │
     │  ✗               └────┬────┘             │
     │                       │                  │
     │   violations          │  compliance      │
     │   (open-ended)        │  (tight cluster) │
     └──────────────────────────────────────────┘
```

We never need to enumerate violations. We only ask: *how far is this message from the known compliance region?* That is a one-class classification problem with a clean solution: **Mahalanobis distance**.

---

---

## Installation

### Prerequisites

- [OpenClaw](https://docs.openclaw.ai/) installed and running
- Node.js 22.16+ (Node 24 recommended)
- `OPENAI_API_KEY` — embeddings and LLM calls (calibration is cached; embeddings and audit calls are ongoing)

---

### Option A — Install from npm *(recommended)*

```bash
npm install @intrinsec-ai/openclaw-identity-plane
npx openclaw plugins install -l node_modules/@intrinsec-ai/openclaw-identity-plane
```

### Option B — Clone directly

```bash
git clone https://github.com/intrinsec-ai/openclaw-identity-plane \
  ~/.openclaw/extensions/identity
cd ~/.openclaw/extensions/identity
npm install
```

### Option C — Link a local checkout *(for development)*

```bash
cd /path/to/openclaw-identity-plane
npm install
openclaw plugins install --link /path/to/openclaw-identity-plane
```

---

### Enable the plugin

```bash
openclaw plugins enable identity
```

Or manually in `~/.openclaw/openclaw.json`:

```json
{
  "plugins": {
    "load": { "paths": ["~/.openclaw/extensions/identity"] },
    "entries": { "identity": { "enabled": true } }
  }
}
```

Remove deprecated config keys if present: `embedModel`, `agenticAudit` (no longer used).

---

### Set your API key

```bash
export OPENAI_API_KEY=sk-...
```

Optional: route both chat and embeddings through a compatible gateway:

```bash
export OPENAI_BASE_URL=http://localhost:11434/v1
```

Set `calibrationModel` to a model your endpoint serves. The embedding client uses the **same** `OPENAI_BASE_URL`; the server must implement OpenAI-style `**POST /v1/embeddings`** for `text-embedding-3-small` (or map that model id to a local embedder), or calibration and runtime embedding wil

... (truncated)

tools