← Back to Skills
Browser

evaluate-presets

paulpete By paulpete 👁 13 views ▲ 0 votes

Use when testing Ralph's hat collection presets

GitHub
---
name: evaluate-presets
description: Use when testing Ralph's hat collection presets, validating preset configurations, or auditing the preset library for bugs and UX issues.
---

# Evaluate Presets

## Overview

Systematically test all hat collection presets using shell scripts. Direct CLI invocationβ€”no meta-orchestration complexity.

## When to Use

- Testing preset configurations after changes
- Auditing the preset library for quality
- Validating new presets work correctly
- After modifying hat routing logic

## Quick Start

**Evaluate a single preset:**
```bash
./tools/evaluate-preset.sh tdd-red-green claude
```

**Evaluate all presets:**
```bash
./tools/evaluate-all-presets.sh claude
```

**Arguments:**
- First arg: preset name (without `.yml` extension)
- Second arg: backend (`claude` or `kiro`, defaults to `claude`)

## Bash Tool Configuration

**IMPORTANT:** When invoking these scripts via the Bash tool, use these settings:

- **Single preset evaluation:** Use `timeout: 600000` (10 minutes max) and `run_in_background: true`
- **All presets evaluation:** Use `timeout: 600000` (10 minutes max) and `run_in_background: true`

Since preset evaluations can run for hours (especially the full suite), **always run in background mode** and use the `TaskOutput` tool to check progress periodically.

**Example invocation pattern:**
```
Bash tool with:
  command: "./tools/evaluate-preset.sh tdd-red-green claude"
  timeout: 600000
  run_in_background: true
```

After launching, use `TaskOutput` with `block: false` to check status without waiting for completion.

## What the Scripts Do

### `evaluate-preset.sh`

1. Loads test task from `tools/preset-test-tasks.yml` (if `yq` available)
2. Creates merged config with evaluation settings
3. Runs Ralph with `--record-session` for metrics capture
4. Captures output logs, exit codes, and timing
5. Extracts metrics: iterations, hats activated, events published

**Output structure:**
```
.eval/
β”œβ”€β”€ logs/<preset>/<timestamp>/
β”‚   β”œβ”€β”€ output.log          # Full stdout/stderr
β”‚   β”œβ”€β”€ session.jsonl       # Recorded session
β”‚   β”œβ”€β”€ metrics.json        # Extracted metrics
β”‚   β”œβ”€β”€ environment.json    # Runtime environment
β”‚   └── merged-config.yml   # Config used
└── logs/<preset>/latest -> <timestamp>
```

### `evaluate-all-presets.sh`

Runs all 12 presets sequentially and generates a summary:

```
.eval/results/<suite-id>/
β”œβ”€β”€ SUMMARY.md              # Markdown report
β”œβ”€β”€ <preset>.json           # Per-preset metrics
└── latest -> <suite-id>
```

## Presets Under Evaluation

| Preset | Test Task |
|--------|-----------|
| `tdd-red-green` | Add `is_palindrome()` function |
| `adversarial-review` | Review user input handler for security |
| `socratic-learning` | Understand `HatRegistry` |
| `spec-driven` | Specify and implement `StringUtils::truncate()` |
| `mob-programming` | Implement a `Stack` data structure |
| `scientific-method` | Debug failing mock test assertion |
| `code-archaeology` | Understand history of `config.rs` |
| `performance-optimization` | Profile hat matching |
| `api-design` | Design a `Cache` trait |
| `documentation-first` | Document `RateLimiter` |
| `incident-response` | Respond to "tests failing in CI" |
| `migration-safety` | Plan v1 to v2 config migration |

## Interpreting Results

**Exit codes from `evaluate-preset.sh`:**
- `0` β€” Success (LOOP_COMPLETE reached)
- `124` β€” Timeout (preset hung or took too long)
- Other β€” Failure (check `output.log`)

**Metrics in `metrics.json`:**
- `iterations` β€” How many event loop cycles
- `hats_activated` β€” Which hats were triggered
- `events_published` β€” Total events emitted
- `completed` β€” Whether completion promise was reached

## Hat Routing Performance

**Critical:** Validate that hats get fresh context per Tenet #1 ("Fresh Context Is Reliability").

### What Good Looks Like

Each hat should execute in its **own iteration**:
```
Iter 1: Ralph β†’ publishes starting event β†’ STOPS
Iter 2: Hat A β†’ does work β†’ publishes next event β†’ STOPS
Iter 3: Hat B β†’ does work β†’ publishes next event β†’ STOPS
Iter 4: Hat C β†’ does work β†’ LOOP_COMPLETE
```

### Red Flags (Same-Iteration Hat Switching)

**BAD:** Multiple hat personas in one iteration:
```
Iter 2: Ralph does Blue Team + Red Team + Fixer work
        ^^^ All in one bloated context!
```

### How to Check

**1. Count iterations vs events in `session.jsonl`:**
```bash
# Count iterations
grep -c "_meta.loop_start\|ITERATION" .eval/logs/<preset>/latest/output.log

# Count events published
grep -c "bus.publish" .eval/logs/<preset>/latest/session.jsonl
```

**Expected:** iterations β‰ˆ events published (one event per iteration)
**Bad sign:** 2-3 iterations but 5+ events (all work in single iteration)

**2. Check for same-iteration hat switching in `output.log`:**
```bash
grep -E "ITERATION|Now I need to perform|Let me put on|I'll switch to" \
    .eval/logs/<preset>/latest/output.log
```

**Red flag:** Hat-switching phrases WITHOUT an ITERATION separator between them.

**3. Check event timestamps in `session.jsonl`:**
```bash
cat .eval/logs/<preset>/latest/session.jsonl | jq -r '.ts'
```

**Red flag:** Multiple events with identical timestamps (published in same iteration).

### Routing Performance Triage

| Pattern | Diagnosis | Action |
|---------|-----------|--------|
| iterations β‰ˆ events | βœ… Good | Hat routing working |
| iterations << events | ⚠️ Same-iteration switching | Check prompt has STOP instruction |
| iterations >> events | ⚠️ Recovery loops | Agent not publishing required events |
| 0 events | ❌ Broken | Events not being read from JSONL |

### Root Cause Checklist

If hat routing is broken:

1. **Check workflow prompt** in `hatless_ralph.rs`:
   - Does it say "CRITICAL: STOP after publishing"?
   - Is the DELEGATE section clear about yielding control?

2. **Check hat instructions** propagation:
   - Does `HatInfo` include `instructions` field?
   - Are instructions rendered in the `## HATS` section?

3. **Check events context**:
   - Is `build_prompt(context)` using the context parameter?
   - Does prompt include `## PENDING EVENTS` section?

## Autonomous Fix Workflow

After evaluation, delegate fixes to subagents:

### Step 1: Triage Results

Read `.eval/results/latest/SUMMARY.md` and identify:
- `❌ FAIL` β†’ Create code tasks for fixes
- `⏱️ TIMEOUT` β†’ Investigate infinite loops
- `⚠️ PARTIAL` β†’ Check for edge cases

### Step 2: Dispatch Task Creation

For each issue, spawn a Task agent:

```
"Use /code-task-generator to create a task for fixing: [issue from evaluation]
Output to: tasks/preset-fixes/"
```

### Step 3: Dispatch Implementation

For each created task:

```
"Use /code-assist to implement: tasks/preset-fixes/[task-file].code-task.md
Mode: auto"
```

### Step 4: Re-evaluate

```bash
./tools/evaluate-preset.sh <fixed-preset> claude
```

## Prerequisites

- **yq** (optional): For loading test tasks from YAML. Install: `brew install yq`
- **Cargo**: Must be able to build Ralph

## Related Files

- `tools/evaluate-preset.sh` β€” Single preset evaluation
- `tools/evaluate-all-presets.sh` β€” Full suite evaluation
- `tools/preset-test-tasks.yml` β€” Test task definitions
- `tools/preset-evaluation-findings.md` β€” Manual findings doc
- `presets/` β€” The preset collection being evaluated
browser

Comments

Sign in to leave a comment

Loading comments...