Browser
agent-takeover
How to perform a live agent takeover of the Clawfinger voice gateway — dial, inject greetings, handle turns.
---
name: agent-takeover
description: How to perform a live agent takeover of the Clawfinger voice gateway — dial, inject greetings, handle turns, release, and observe handback. Covers timing, endpoints, the WebSocket protocol, and includes a human-guided test case.
metadata:
openclaw:
emoji: "\U0001F3AF"
skillKey: agent-takeover
requires:
- plugin:clawfinger
---
# Agent Takeover — Full Lifecycle Guide
How an external agent (OpenClaw plugin, custom script, or any WebSocket client) takes control of a live phone call, handles conversation turns directly, and hands back to the local LLM.
## Architecture Overview
```
Caller <--> Phone App <--> Gateway /api/turn <--> Local LLM
|
+-- (takeover) --> Agent WS
```
Normal flow: phone sends audio to `/api/turn`, gateway runs ASR → LLM → TTS, returns audio.
Takeover flow: after `takeover`, gateway sends `turn.request` to the agent WebSocket instead of calling the local LLM. The agent replies with text, gateway runs TTS, returns audio to phone.
## Endpoints Used
### WebSocket (primary — full bidirectional control)
**`WS /api/agent/ws`** — No authentication required on the WebSocket itself. Connects, receives all bus events, and sends commands.
| Send (agent → gateway) | Fields | Description |
|-------------------------|--------|-------------|
| `dial` | `number` | Dial outbound call via ADB |
| `inject` | `text`, `session_id` | Queue TTS message for next turn poll |
| `takeover` | `session_id` | Take over LLM for this session |
| `release` | `session_id` | Hand back to local LLM |
| `hangup` | `session_id` (optional) | Force hang up call + end session |
| `get_call_state` | `session_id` | Query conversation history and state |
| `end_session` | `session_id` | Mark session ended without phone hangup |
| `inject_context` | `session_id`, `context` | Push knowledge into LLM context |
| `clear_context` | `session_id` | Remove injected knowledge |
| `ping` | — | Heartbeat |
| Receive (gateway → agent) | Fields | Description |
|----------------------------|--------|-------------|
| `dial.ack` | `ok`, `detail` | Dial result |
| `takeover.ack` | `ok`, `session_id` | Takeover confirmed |
| `release.ack` | `ok`, `session_id` | Release confirmed |
| `hangup.ack` | `ok`, `detail`, `session_id` | Hangup result |
| `turn.request` | `session_id`, `transcript`, `request_id` | **Takeover only** — caller spoke, agent must reply |
| `turn.started` | `session_id` | Turn processing began |
| `turn.transcript` | `transcript` | ASR result |
| `turn.reply` | `reply` | LLM/agent reply text |
| `turn.complete` | `metrics`, `transcript`, `reply`, `model` | Turn finished |
| `session.ended` | `session_id` | Session ended (stale sweep, hangup, or explicit end) |
### REST (alternative — no persistent connection needed)
| Method | Path | Purpose |
|--------|------|---------|
| `POST` | `/api/call/dial` | `{"number": "+49..."}` — dial via ADB |
| `POST` | `/api/call/hangup` | `{"session_id": "..."}` — force hangup |
| `POST` | `/api/call/inject` | `{"text": "...", "session_id": "..."}` — inject TTS |
| `GET` | `/api/agent/sessions` | List active session IDs |
| `GET` | `/api/agent/call/{sid}` | Full call state (history, instructions, takeover) |
| `POST` | `/api/agent/context/{sid}` | `{"context": "..."}` — inject knowledge |
**REST cannot do takeover.** Takeover requires the WebSocket for real-time `turn.request` / reply exchange. REST is fine for dial, inject, hangup, and observation.
## Takeover Turn Protocol
During takeover, the gateway replaces the local LLM with the agent for response generation:
```
Phone → /api/turn (audio) → Gateway ASR → transcript
↓
Gateway sends to Agent WS:
{"type": "turn.request",
"session_id": "abc123",
"transcript": "what caller said",
"request_id": "unique-id"}
↓
Agent replies on same WS:
{"reply": "agent's response",
"request_id": "unique-id"}
↓
Gateway TTS → audio → Phone
```
### Critical: `request_id` correlation
The agent **must** echo back the `request_id` from the `turn.request`. Without it, the gateway cannot match the reply to the pending turn and the request times out.
```json
// Gateway sends:
{"type": "turn.request", "session_id": "abc", "transcript": "hello", "request_id": "a1b2c3"}
// Agent must reply:
{"reply": "Hi there!", "request_id": "a1b2c3"}
```
No `type` field needed in the reply — just `reply` + `request_id`.
### Timeout and fallback
If the agent doesn't reply within the timeout (default 60s, configurable via `agent_takeover_timeout` in config), the gateway falls back to the local LLM for **that single turn**. The takeover remains active — the next turn will try the agent again.
## Timing Model
Understanding timing is critical for a smooth takeover experience.
### Phone polling cadence
The phone app polls `/api/turn` in a tight loop:
1. Record audio chunk (~2-5s of speech)
2. POST to `/api/turn`
3. Wait for response (ASR + LLM/agent + TTS)
4. Play response audio
5. Go to step 1
The phone does NOT poll on a fixed interval — it sends the next turn as soon as playback finishes and new audio is captured. Typical turn cycle: 3-8 seconds.
### Inject timing
`inject` queues a pre-synthesized TTS message. It's delivered on the **next** `/api/turn` poll, **before** ASR/LLM processing:
```
Agent injects "Hello!" at T=0
↓
Phone polls /api/turn at T=3 (next natural poll)
↓
Gateway sees pending inject → returns inject audio immediately (skips ASR/LLM)
↓
Phone plays "Hello!" → polls again
```
**Key implications:**
- Inject is NOT instant — there's a delay of up to one poll cycle (3-8s)
- During takeover, the phone is usually waiting for the agent's reply, so the next poll happens quickly after the agent responds
- Multiple injects queue up — each delivered on successive polls
- Inject skips ASR entirely — the phone's recorded audio is ignored for that poll
### Takeover timing
```
T=0 Agent sends {"type": "takeover", "session_id": "..."}
T=0 Gateway immediately routes future turns to agent
T=0 Agent gets {"type": "takeover.ack", "ok": true}
T=3-8 Phone polls /api/turn → gateway ASR → turn.request sent to agent
T=3-8 Agent replies → gateway TTS → phone plays agent's response
```
Takeover takes effect instantly on the gateway side. The first `turn.request` arrives on the next phone poll.
### Release timing
```
T=0 Agent sends {"type": "release", "session_id": "..."}
T=0 Gateway removes takeover → local LLM handles future turns
T=0 Agent gets {"type": "release.ack", "ok": true}
T=3-8 Phone polls /api/turn → local LLM responds (no more turn.requests to agent)
```
Release is also instant. The agent continues receiving bus events (turn.complete, etc.) but no longer gets `turn.request` messages.
### Inject + Takeover ordering
When you inject a greeting AND takeover in quick succession:
```
T=0.0 Agent injects greeting text
T=0.5 Agent sends takeover
T=3 Phone polls → gets inject (greeting plays) — takeover is active but no turn.request yet
T=8 Phone polls again → NOW it's a takeover turn → turn.request sent to agent
```
The inject is consumed first (it takes priority in the turn endpoint), then takeover kicks in on the subsequent poll. This is the correct order for "inject greeting then take over."
### Dial-to-first-turn latency
```
T=0 Agent sends dial command
T=0 ADB broadcast sent to phone
T=1-3 Phone initiates outbound call
T=5-30 Callee picks up (depends on the person)
T=+1 Phone detects call connected, sends first /api/turn (greeting)
T=+2 Gateway processes greeting (forced_reply → TTS only, no ASR/LLM)
T=+5 Greeting plays, phone captures first real audio
T=+8 First real turn arrives at gateway
```
Total dial-to-first-real-turn: 10-40 seconds depending on pickup time.
## Session Lifecycle
Sessions have a TTL of **60 seconds** of inactivity (configurable via `session_ttl`). The phone polls every 3-8s during a call, so active calls never hit the TTL. But if:
- The phone crashes or loses USB connection
- The caller hangs up and the phone doesn't send `/api/session/end`
...the session auto-ends after 60s of no polls.
**Session detection after dial:** After dialing, the agent needs the session ID. Options:
1. **Watch bus events** — a `turn.started` event with `session_id` arrives when the call connects
2. **Poll `GET /api/agent/sessions`** — check for new session IDs
3. **Send `get_call_state`** — if you know the session ID
Option 1 (events) is most reliable and fastest.
## Complete Takeover Lifecycle
```
1. CONNECT ws://gateway:8996/api/agent/ws
2. DIAL {"type": "dial", "number": "+49..."}
WAIT for dial.ack (ok: true)
3. DISCOVER watch events for session_id (turn.started or session.started)
4. INJECT {"type": "inject", "session_id": "...", "text": "Custom greeting"}
5. TAKEOVER {"type": "takeover", "session_id": "..."}
WAIT for takeover.ack (ok: true)
6. HANDLE receive turn.request → reply with {reply, request_id}
REPEAT for N turns
7. RELEASE {"type": "release", "session_id": "..."}
WAIT for release.ack (ok: true)
8. OBSERVE watch turn.complete events → model != "agent" confirms local LLM resumed
9. HANGUP {"type": "hangup", "session_id": "..."} (optional — end call)
```
## Error Handling
| Scenario | What happens |
|----------|-------------|
| Agent WS disconnects during takeover | All takeovers auto-released, local LLM resumes |
| Agent doesn't reply within 60s | That turn falls back to local LLM; ta
... (truncated)
browser
By
Comments
Sign in to leave a comment