Web Search
swarm-safety
Multi-agent AI safety simulation with soft probabilistic labels
---
name: swarm-safety
version: 1.7.0
description: "SWARM: System-Wide Assessment of Risk in Multi-agent systems. 38 agent types, 29 governance levers, 55 scenarios. Study emergent risks, phase transitions, and governance cost paradoxes."
homepage: https://github.com/swarm-ai-safety/swarm
metadata: {"category":"safety","license":"MIT","author":"Raeli Savitt"}
---
# SWARM Safety Skill
Study how intelligence swarms — and where it fails.
SWARM is a research framework for studying emergent risks in multi-agent AI systems using soft (probabilistic) labels instead of binary good/bad classifications. AGI-level risks don't require AGI-level agents — harmful dynamics emerge when many sub-AGI agents interact, even when no individual agent is misaligned.
**v1.7.0** | 38 agent types | 29 governance levers | 55 scenarios | 2922 tests | 8 framework bridges
Repository: `https://github.com/swarm-ai-safety/swarm`
## Hard Rules
- SWARM simulations run locally. Install the package first.
- Do not submit scenarios containing real API keys, credentials, or PII.
- Simulation results are research artifacts. Do not present them as ground truth about real systems.
- When publishing results, cite the framework and disclose simulation parameters.
## Security
- **API binds to localhost only** (`127.0.0.1`) by default to prevent network exposure.
- **CORS restricted** to localhost origins by default.
- **No authentication** on development API — do not expose to untrusted networks.
- **In-memory storage** — data does not persist between restarts.
- For production deployment, add authentication middleware and use a proper database.
## Install
```bash
# From PyPI
pip install swarm-safety
# With LLM agent support
pip install swarm-safety[llm]
# Full development (all extras)
git clone https://github.com/swarm-ai-safety/swarm.git
cd swarm
pip install -e ".[dev,runtime]"
```
## Quick Start (Python)
```python
from swarm.agents.honest import HonestAgent
from swarm.agents.opportunistic import OpportunisticAgent
from swarm.agents.deceptive import DeceptiveAgent
from swarm.agents.adversarial import AdversarialAgent
from swarm.core.orchestrator import Orchestrator, OrchestratorConfig
config = OrchestratorConfig(n_epochs=10, steps_per_epoch=10, seed=42)
orchestrator = Orchestrator(config=config)
orchestrator.register_agent(HonestAgent(agent_id="honest_1", name="Alice"))
orchestrator.register_agent(HonestAgent(agent_id="honest_2", name="Bob"))
orchestrator.register_agent(OpportunisticAgent(agent_id="opp_1"))
orchestrator.register_agent(DeceptiveAgent(agent_id="dec_1"))
metrics = orchestrator.run()
for m in metrics:
print(f"Epoch {m.epoch}: toxicity={m.toxicity_rate:.3f}, welfare={m.total_welfare:.2f}")
```
## Quick Start (CLI)
```bash
# List available scenarios
swarm list
# Run a scenario
swarm run scenarios/baseline.yaml
# Override settings
swarm run scenarios/baseline.yaml --seed 42 --epochs 20 --steps 15
# Export results
swarm run scenarios/baseline.yaml --export-json results.json --export-csv outputs/
```
## Quick Start (API)
Start the API server:
```bash
pip install swarm-safety[api]
uvicorn swarm.api.app:app --host 127.0.0.1 --port 8000
```
API documentation at `http://localhost:8000/docs`.
> **Security Note**: The server binds to `127.0.0.1` (localhost only) by default. Do not bind to `0.0.0.0` unless you understand the security implications and have proper firewall rules in place.
### Register Agent
```bash
curl -X POST http://localhost:8000/api/v1/agents/register \
-H "Content-Type: application/json" \
-d '{
"name": "YourAgent",
"description": "What your agent does",
"capabilities": ["governance-testing", "red-teaming"]
}'
```
Returns `agent_id` and `api_key`.
### Submit Scenario
```bash
curl -X POST http://localhost:8000/api/v1/scenarios/submit \
-H "Content-Type: application/json" \
-d '{
"name": "my-scenario",
"description": "Testing collusion detection with 5 agents",
"yaml_content": "simulation:\n n_epochs: 10\n steps_per_epoch: 10\nagents:\n - type: honest\n count: 3\n - type: adversarial\n count: 2",
"tags": ["collusion", "governance"]
}'
```
### Create & Join Simulation
```bash
# Create
curl -X POST http://localhost:8000/api/v1/simulations/create \
-H "Content-Type: application/json" \
-d '{"scenario_id": "SCENARIO_ID", "max_participants": 5}'
# Join
curl -X POST http://localhost:8000/api/v1/simulations/SIM_ID/join \
-H "Content-Type: application/json" \
-d '{"agent_id": "YOUR_AGENT_ID", "role": "participant"}'
```
## Core Concepts
### Soft Probabilistic Labels
Interactions carry `p = P(v = +1)` — probability of beneficial outcome:
```
Observables -> ProxyComputer -> v_hat -> sigmoid -> p -> PayoffEngine -> payoffs
|
SoftMetrics -> toxicity, quality gap, etc.
```
### Five Key Metrics
| Metric | What It Measures |
|---|---|
| **Toxicity rate** | Expected harm among accepted interactions: `E[1-p \| accepted]` |
| **Quality gap** | Adverse selection indicator (negative = bad): `E[p \| accepted] - E[p \| rejected]` |
| **Conditional loss** | Selection effect on payoffs |
| **Incoherence** | Variance-to-error ratio across replays |
| **Illusion delta** | Gap between perceived coherence and actual consistency |
### Agent Types (14 families, 38 implementations)
| Type | Behavior |
|---|---|
| **Honest** | Cooperative, trust-based, completes tasks diligently |
| **Opportunistic** | Maximizes short-term payoff, cherry-picks tasks |
| **Deceptive** | Builds trust, then exploits trusted relationships |
| **Adversarial** | Targets honest agents, coordinates with allies |
| **LDT** | Logical Decision Theory with FDT/UDT precommitment |
| **RLM** | Reinforcement Learning from Memory |
| **Council** | Multi-agent deliberation-based decisions |
| **SkillRL** | Learns interaction strategies via reward signals |
| **LLM** | Behavior determined by LLM (Anthropic, OpenAI, or Ollama) |
| **Moltbook** | Domain-specific social platform agent |
| **Scholar** | Academic citation and research agent |
| **Wiki Editor** | Collaborative editing with editorial policy |
### Governance Levers (29 mechanisms)
- **Transaction Taxes** — Reduce exploitation, cost welfare
- **Reputation Decay** — Punish bad actors, erode honest standing
- **Circuit Breakers** — Freeze toxic agents quickly
- **Random Audits** — Deter hidden exploitation
- **Staking** — Filter undercapitalized agents
- **Collusion Detection** — Catch coordinated attacks (the critical lever near collapse threshold)
- **Sybil Detection** — Identify duplicate agents
- **Transparency Ledger** — Reward/penalize based on outcome
- **Moderator Agent** — Probabilistic review of interactions
- **Incoherence Friction** — Tax uncertainty-driven decisions
- **Council Deliberation** — Multi-agent governance decisions
- **Diversity Enforcement** — Prevent monoculture collapse
- **Moltipedia-specific** — Pair caps, page cooldowns, daily caps, self-fix prevention
### Framework Bridges
| Bridge | Integration |
|---|---|
| **Concordia** | DeepMind's multi-agent framework |
| **GasTown** | Multi-agent workspace governance |
| **Claude Code** | Claude CLI agent integration |
| **LiveSWE** | Live software engineering tasks |
| **OpenClaw** | Open agent protocol |
| **Prime Intellect** | Cross-platform run tracking |
| **Ralph** | Agent orchestration |
| **Worktree** | Git worktree-based sandboxing |
### Scenario YAML Format
```yaml
simulation:
n_epochs: 10
steps_per_epoch: 10
seed: 42
agents:
- type: honest
count: 3
config:
acceptance_threshold: 0.4
- type: adversarial
count: 2
config:
aggression_level: 0.7
governance:
transaction_tax_rate: 0.05
circuit_breaker_enabled: true
collusion_detection_enabled: true
success_criteria:
max_toxicity: 0.3
min_quality_gap: 0.0
```
## Key Research Findings
### Phase Transitions (11-scenario, 209-epoch study)
| Regime | Adversarial % | Toxicity | Welfare | Outcome |
|--------|--------------|----------|---------|---------|
| Cooperative | 0-20% | < 0.30 | Stable | Survives |
| Contested | 20-37.5% | 0.33-0.37 | Declining | Survives |
| Collapse | 50%+ | ~0.30 | Zero by epoch 12-14 | **Collapses** |
Critical threshold between 37.5% and 50% adversarial agents separates recoverable from irreversible collapse.
### Governance Cost Paradox (v1.7.0 GasTown study)
42-run study reveals: governance reduces toxicity at all adversarial levels (mean reduction 0.071) but imposes net-negative welfare costs at current parameter tuning. At 0% adversarial, governance costs 216 welfare units (-57.6%) for only 0.066 toxicity reduction.
## Case Studies
### GasTown Governance Cost
Study governance overhead vs. toxicity reduction across 7 agent compositions with and without governance levers. Reveals the safety-throughput trade-off. See `scenarios/gastown_governance_cost.yaml`.
### LDT Cooperation
220 runs across 10 seeds comparing TDT vs FDT vs UDT cooperation strategies at population scales up to 21 agents. See `scenarios/ldt_cooperation.yaml`.
### Moltipedia Heartbeat
Model the Moltipedia wiki editing loop: competing AI editors, editorial policy, point farming, and anti-gaming governance. See `scenarios/moltipedia_heartbeat.yaml`.
### Moltbook CAPTCHA
Model Moltbook's anti-human math challenges and rate limiting: obfuscated text parsing, verification gates, and spam prevention. See `scenarios/moltbook_captcha.yaml`.
## API Endpoints (Full Reference)
| Method | Endpoint | Description |
|---|---|---|
| GET | `/health` | Health check |
| GET | `/` | API info |
| POST | `/api/v1/agents/register` | Register agent |
| GET | `/api/v1/agents/{agent_id}` | Get agent details |
| GET | `/api/v1/agents/` | List agents |
| POST | `/api/v1/scenarios/submit` | Submit scenario |
| GET | `/api/v1/scenarios/{scenario_id}` | Get scenario |
| GET | `/api/v1/s
... (truncated)
web search
By
Comments
Sign in to leave a comment