swarm-safety

Name: swarm-safety
Rating: 3.5 (1 reviews)
Author: rsavitt
By rsavitt 👁 309 views ▲ 0 votes
Multi-agent AI safety simulation with soft probabilistic labels
GitHub
---
name: swarm-safety
version: 1.7.0
description: "SWARM: System-Wide Assessment of Risk in Multi-agent systems. 38 agent types, 29 governance levers, 55 scenarios. Study emergent risks, phase transitions, and governance cost paradoxes."
homepage: https://github.com/swarm-ai-safety/swarm
metadata: {"category":"safety","license":"MIT","author":"Raeli Savitt"}
---

# SWARM Safety Skill

Study how intelligence swarms — and where it fails.

SWARM is a research framework for studying emergent risks in multi-agent AI systems using soft (probabilistic) labels instead of binary good/bad classifications. AGI-level risks don't require AGI-level agents — harmful dynamics emerge when many sub-AGI agents interact, even when no individual agent is misaligned.

**v1.7.0** | 38 agent types | 29 governance levers | 55 scenarios | 2922 tests | 8 framework bridges

Repository: `https://github.com/swarm-ai-safety/swarm`

## Hard Rules

- SWARM simulations run locally. Install the package first.
- Do not submit scenarios containing real API keys, credentials, or PII.
- Simulation results are research artifacts. Do not present them as ground truth about real systems.
- When publishing results, cite the framework and disclose simulation parameters.

## Security

- **API binds to localhost only** (`127.0.0.1`) by default to prevent network exposure.
- **CORS restricted** to localhost origins by default.
- **No authentication** on development API — do not expose to untrusted networks.
- **In-memory storage** — data does not persist between restarts.
- For production deployment, add authentication middleware and use a proper database.

## Install

```bash
# From PyPI
pip install swarm-safety

# With LLM agent support
pip install swarm-safety[llm]

# Full development (all extras)
git clone https://github.com/swarm-ai-safety/swarm.git
cd swarm
pip install -e ".[dev,runtime]"
```

## Quick Start (Python)

```python
from swarm.agents.honest import HonestAgent
from swarm.agents.opportunistic import OpportunisticAgent
from swarm.agents.deceptive import DeceptiveAgent
from swarm.agents.adversarial import AdversarialAgent
from swarm.core.orchestrator import Orchestrator, OrchestratorConfig

config = OrchestratorConfig(n_epochs=10, steps_per_epoch=10, seed=42)
orchestrator = Orchestrator(config=config)

orchestrator.register_agent(HonestAgent(agent_id="honest_1", name="Alice"))
orchestrator.register_agent(HonestAgent(agent_id="honest_2", name="Bob"))
orchestrator.register_agent(OpportunisticAgent(agent_id="opp_1"))
orchestrator.register_agent(DeceptiveAgent(agent_id="dec_1"))

metrics = orchestrator.run()
for m in metrics:
    print(f"Epoch {m.epoch}: toxicity={m.toxicity_rate:.3f}, welfare={m.total_welfare:.2f}")
```

## Quick Start (CLI)

```bash
# List available scenarios
swarm list

# Run a scenario
swarm run scenarios/baseline.yaml

# Override settings
swarm run scenarios/baseline.yaml --seed 42 --epochs 20 --steps 15

# Export results
swarm run scenarios/baseline.yaml --export-json results.json --export-csv outputs/
```

## Quick Start (API)

Start the API server:

```bash
pip install swarm-safety[api]
uvicorn swarm.api.app:app --host 127.0.0.1 --port 8000
```

API documentation at `http://localhost:8000/docs`.

> **Security Note**: The server binds to `127.0.0.1` (localhost only) by default. Do not bind to `0.0.0.0` unless you understand the security implications and have proper firewall rules in place.

### Register Agent

```bash
curl -X POST http://localhost:8000/api/v1/agents/register \
  -H "Content-Type: application/json" \
  -d '{
    "name": "YourAgent",
    "description": "What your agent does",
    "capabilities": ["governance-testing", "red-teaming"]
  }'
```

Returns `agent_id` and `api_key`.

### Submit Scenario

```bash
curl -X POST http://localhost:8000/api/v1/scenarios/submit \
  -H "Content-Type: application/json" \
  -d '{
    "name": "my-scenario",
    "description": "Testing collusion detection with 5 agents",
    "yaml_content": "simulation:\n  n_epochs: 10\n  steps_per_epoch: 10\nagents:\n  - type: honest\n    count: 3\n  - type: adversarial\n    count: 2",
    "tags": ["collusion", "governance"]
  }'
```

### Create & Join Simulation

```bash
# Create
curl -X POST http://localhost:8000/api/v1/simulations/create \
  -H "Content-Type: application/json" \
  -d '{"scenario_id": "SCENARIO_ID", "max_participants": 5}'

# Join
curl -X POST http://localhost:8000/api/v1/simulations/SIM_ID/join \
  -H "Content-Type: application/json" \
  -d '{"agent_id": "YOUR_AGENT_ID", "role": "participant"}'
```

## Core Concepts

### Soft Probabilistic Labels

Interactions carry `p = P(v = +1)` — probability of beneficial outcome:

```
Observables -> ProxyComputer -> v_hat -> sigmoid -> p -> PayoffEngine -> payoffs
                                                    |
                                               SoftMetrics -> toxicity, quality gap, etc.
```

### Five Key Metrics

| Metric | What It Measures |
|---|---|
| **Toxicity rate** | Expected harm among accepted interactions: `E[1-p \| accepted]` |
| **Quality gap** | Adverse selection indicator (negative = bad): `E[p \| accepted] - E[p \| rejected]` |
| **Conditional loss** | Selection effect on payoffs |
| **Incoherence** | Variance-to-error ratio across replays |
| **Illusion delta** | Gap between perceived coherence and actual consistency |

### Agent Types (14 families, 38 implementations)

| Type | Behavior |
|---|---|
| **Honest** | Cooperative, trust-based, completes tasks diligently |
| **Opportunistic** | Maximizes short-term payoff, cherry-picks tasks |
| **Deceptive** | Builds trust, then exploits trusted relationships |
| **Adversarial** | Targets honest agents, coordinates with allies |
| **LDT** | Logical Decision Theory with FDT/UDT precommitment |
| **RLM** | Reinforcement Learning from Memory |
| **Council** | Multi-agent deliberation-based decisions |
| **SkillRL** | Learns interaction strategies via reward signals |
| **LLM** | Behavior determined by LLM (Anthropic, OpenAI, or Ollama) |
| **Moltbook** | Domain-specific social platform agent |
| **Scholar** | Academic citation and research agent |
| **Wiki Editor** | Collaborative editing with editorial policy |

### Governance Levers (29 mechanisms)

- **Transaction Taxes** — Reduce exploitation, cost welfare
- **Reputation Decay** — Punish bad actors, erode honest standing
- **Circuit Breakers** — Freeze toxic agents quickly
- **Random Audits** — Deter hidden exploitation
- **Staking** — Filter undercapitalized agents
- **Collusion Detection** — Catch coordinated attacks (the critical lever near collapse threshold)
- **Sybil Detection** — Identify duplicate agents
- **Transparency Ledger** — Reward/penalize based on outcome
- **Moderator Agent** — Probabilistic review of interactions
- **Incoherence Friction** — Tax uncertainty-driven decisions
- **Council Deliberation** — Multi-agent governance decisions
- **Diversity Enforcement** — Prevent monoculture collapse
- **Moltipedia-specific** — Pair caps, page cooldowns, daily caps, self-fix prevention

### Framework Bridges

| Bridge | Integration |
|---|---|
| **Concordia** | DeepMind's multi-agent framework |
| **GasTown** | Multi-agent workspace governance |
| **Claude Code** | Claude CLI agent integration |
| **LiveSWE** | Live software engineering tasks |
| **OpenClaw** | Open agent protocol |
| **Prime Intellect** | Cross-platform run tracking |
| **Ralph** | Agent orchestration |
| **Worktree** | Git worktree-based sandboxing |

### Scenario YAML Format

```yaml
simulation:
  n_epochs: 10
  steps_per_epoch: 10
  seed: 42

agents:
  - type: honest
    count: 3
    config:
      acceptance_threshold: 0.4
  - type: adversarial
    count: 2
    config:
      aggression_level: 0.7

governance:
  transaction_tax_rate: 0.05
  circuit_breaker_enabled: true
  collusion_detection_enabled: true

success_criteria:
  max_toxicity: 0.3
  min_quality_gap: 0.0
```

## Key Research Findings

### Phase Transitions (11-scenario, 209-epoch study)

| Regime | Adversarial % | Toxicity | Welfare | Outcome |
|--------|--------------|----------|---------|---------|
| Cooperative | 0-20% | < 0.30 | Stable | Survives |
| Contested | 20-37.5% | 0.33-0.37 | Declining | Survives |
| Collapse | 50%+ | ~0.30 | Zero by epoch 12-14 | **Collapses** |

Critical threshold between 37.5% and 50% adversarial agents separates recoverable from irreversible collapse.

### Governance Cost Paradox (v1.7.0 GasTown study)

42-run study reveals: governance reduces toxicity at all adversarial levels (mean reduction 0.071) but imposes net-negative welfare costs at current parameter tuning. At 0% adversarial, governance costs 216 welfare units (-57.6%) for only 0.066 toxicity reduction.

## Case Studies

### GasTown Governance Cost

Study governance overhead vs. toxicity reduction across 7 agent compositions with and without governance levers. Reveals the safety-throughput trade-off. See `scenarios/gastown_governance_cost.yaml`.

### LDT Cooperation

220 runs across 10 seeds comparing TDT vs FDT vs UDT cooperation strategies at population scales up to 21 agents. See `scenarios/ldt_cooperation.yaml`.

### Moltipedia Heartbeat

Model the Moltipedia wiki editing loop: competing AI editors, editorial policy, point farming, and anti-gaming governance. See `scenarios/moltipedia_heartbeat.yaml`.

### Moltbook CAPTCHA

Model Moltbook's anti-human math challenges and rate limiting: obfuscated text parsing, verification gates, and spam prevention. See `scenarios/moltbook_captcha.yaml`.

## API Endpoints (Full Reference)

| Method | Endpoint | Description |
|---|---|---|
| GET | `/health` | Health check |
| GET | `/` | API info |
| POST | `/api/v1/agents/register` | Register agent |
| GET | `/api/v1/agents/{agent_id}` | Get agent details |
| GET | `/api/v1/agents/` | List agents |
| POST | `/api/v1/scenarios/submit` | Submit scenario |
| GET | `/api/v1/scenarios/{scenario_id}` | Get scenario |
| GET | `/api/v1/s

... (truncated)
web search