Browser
skill-reviewer
Review and audit agent skills (SKILL.md files)
---
name: skill-reviewer
description: Review and audit agent skills (SKILL.md files) for quality, correctness, and effectiveness. Use when evaluating a skill before publishing, reviewing someone else's skill, scoring skill quality, identifying defects in skill content, or improving an existing skill.
metadata: {"clawdbot":{"emoji":"🔍","requires":{"anyBins":["npx"]},"os":["linux","darwin","win32"]}}
---
# Skill Reviewer
Audit agent skills (SKILL.md files) for quality, correctness, and completeness. Provides a structured review framework with scoring rubric, defect checklists, and improvement recommendations.
## When to Use
- Reviewing a skill before publishing to the registry
- Evaluating a skill you downloaded from the registry
- Auditing your own skills for quality improvements
- Comparing skills in the same category
- Deciding whether a skill is worth installing
## Review Process
### Step 1: Structural Check
Verify the skill has the required structure. Read the file and check each item:
```
STRUCTURAL CHECKLIST:
[ ] Valid YAML frontmatter (opens and closes with ---)
[ ] `name` field present and is a valid slug (lowercase, hyphenated)
[ ] `description` field present and non-empty
[ ] `metadata` field present with valid JSON
[ ] `metadata.clawdbot.emoji` is a single emoji
[ ] `metadata.clawdbot.requires.anyBins` lists real CLI tools
[ ] Title heading (# Title) immediately after frontmatter
[ ] Summary paragraph after title
[ ] "When to Use" section present
[ ] At least 3 main content sections
[ ] "Tips" section present at the end
```
### Step 2: Frontmatter Quality
#### Description field audit
The description is the most impactful field. Evaluate it against these criteria:
```
DESCRIPTION SCORING:
[2] Starts with what the skill does (active verb)
GOOD: "Write Makefiles for any project type."
BAD: "This skill covers Makefiles."
BAD: "A comprehensive guide to Make."
[2] Includes trigger phrases ("Use when...")
GOOD: "Use when setting up build automation, defining multi-target builds"
BAD: No trigger phrases at all
[2] Specific scope (mentions concrete tools, languages, or operations)
GOOD: "SQLite/PostgreSQL/MySQL — schema design, queries, CTEs, window functions"
BAD: "Database stuff"
[1] Reasonable length (50-200 characters)
TOO SHORT: "Make things" (no search surface)
TOO LONG: 300+ characters (gets truncated)
[1] Contains searchable keywords naturally
GOOD: "cron jobs, systemd timers, scheduling"
BAD: Keywords stuffed unnaturally
Score: __/8
```
#### Metadata audit
```
METADATA SCORING:
[1] emoji is relevant to the skill topic
[1] requires.anyBins lists tools the skill actually uses (not generic tools like "bash")
[1] os array is accurate (don't claim win32 if commands are Linux-only)
[1] JSON is valid (test with a JSON parser)
Score: __/4
```
### Step 3: Content Quality
#### Example density
Count code blocks and total lines:
```
EXAMPLE DENSITY:
Lines: ___
Code blocks: ___
Ratio: 1 code block per ___ lines
TARGET: 1 code block per 8-15 lines
< 8 lines per block: possibly over-fragmented
> 20 lines per block: needs more examples
```
#### Example quality
For each code block, check:
```
EXAMPLE QUALITY CHECKLIST:
[ ] Language tag specified (```bash, ```python, etc.)
[ ] Command is syntactically correct
[ ] Output shown in comments where helpful
[ ] Uses realistic values (not foo/bar/baz)
[ ] No placeholder values left (TODO, FIXME, xxx)
[ ] Self-contained (doesn't depend on undefined variables)
OR setup is shown/referenced
[ ] Covers the common case (not just edge cases)
```
Score each example 0-3:
- **0**: Broken or misleading
- **1**: Works but minimal (no output, no context)
- **2**: Good (correct, has output or explanation)
- **3**: Excellent (copy-pasteable, realistic, covers edge case)
#### Section organization
```
ORGANIZATION SCORING:
[2] Organized by task/scenario (not by abstract concept)
GOOD: "## Encode and Decode" → "## Inspect Characters" → "## Convert Formats"
BAD: "## Theory" → "## Types" → "## Advanced"
[2] Most common operations come first
GOOD: Basic usage → Variations → Advanced → Edge cases
BAD: Configuration → Theory → Finally the basic usage
[1] Sections are self-contained (can be used independently)
[1] Consistent depth (not mixing h2 with h4 randomly)
Score: __/6
```
#### Cross-platform accuracy
```
PLATFORM CHECKLIST:
[ ] macOS differences noted where relevant
(sed -i '' vs sed -i, brew vs apt, BSD vs GNU flags)
[ ] Linux distro variations noted (apt vs yum vs pacman)
[ ] Windows compatibility addressed if os includes "win32"
[ ] Tool version assumptions stated (Docker v2 syntax, Python 3.x)
```
### Step 4: Actionability Assessment
The core question: can an agent follow these instructions to produce correct results?
```
ACTIONABILITY SCORING:
[3] Instructions are imperative ("Run X", "Create Y")
NOT: "You might consider..." or "It's recommended to..."
[3] Steps are ordered logically (prerequisites before actions)
[2] Error cases addressed (what to do when something fails)
[2] Output/result described (how to verify it worked)
Score: __/10
```
### Step 5: Tips Section Quality
```
TIPS SCORING:
[2] 5-10 tips present
[2] Tips are non-obvious (not "read the documentation")
GOOD: "The number one Makefile bug: spaces instead of tabs"
BAD: "Make sure to test your code"
[2] Tips are specific and actionable
GOOD: "Use flock to prevent overlapping cron runs"
BAD: "Be careful with concurrent execution"
[1] No tips contradict the main content
[1] Tips cover gotchas/footguns specific to this topic
Score: __/8
```
## Scoring Summary
```
SKILL REVIEW SCORECARD
═══════════════════════════════════════
Skill: [name]
Reviewer: [agent/human]
Date: [date]
Category Score Max
─────────────────────────────────────
Structure __ 11
Description __ 8
Metadata __ 4
Example density __ 3*
Example quality __ 3*
Organization __ 6
Actionability __ 10
Tips __ 8
─────────────────────────────────────
TOTAL __ 53+
* Example density and quality are per-sample,
not summed. Use the average across all examples.
RATING:
45+ Excellent — publish-ready
35-44 Good — minor improvements needed
25-34 Fair — significant gaps to address
< 25 Poor — needs major rework
VERDICT: [PUBLISH / REVISE / REWORK]
```
## Common Defects
### Critical (blocks publishing)
```
DEFECT: Invalid frontmatter
DETECT: YAML parse error, missing required fields
FIX: Validate YAML, ensure name/description/metadata all present
DEFECT: Broken code examples
DETECT: Syntax errors, undefined variables, wrong flags
FIX: Test every command in a clean environment
DEFECT: Wrong tool requirements
DETECT: metadata.requires lists tools not used in content, or omits tools that are used
FIX: Grep content for command names, update requires to match
DEFECT: Misleading description
DETECT: Description promises coverage the content doesn't deliver
FIX: Align description with actual content, or add missing content
```
### Major (should fix before publishing)
```
DEFECT: No "When to Use" section
IMPACT: Agent doesn't know when to activate the skill
FIX: Add 4-8 bullet points describing trigger scenarios
DEFECT: Text walls without examples
DETECT: Any section > 10 lines with no code block
FIX: Add concrete examples for every concept described
DEFECT: Examples missing language tags
DETECT: ``` without language identifier
FIX: Add bash, python, javascript, yaml, etc. to every code fence
DEFECT: No Tips section
IMPACT: Missing the distilled expertise that makes a skill valuable
FIX: Add 5-10 non-obvious, actionable tips
DEFECT: Abstract organization
DETECT: Sections named "Theory", "Overview", "Background", "Introduction"
FIX: Reorganize by task/operation: what the user is trying to DO
```
### Minor (nice to fix)
```
DEFECT: Placeholder values
DETECT: foo, bar, baz, example.com, 1.2.3.4, TODO, FIXME
FIX: Replace with realistic values (myapp, api.example.com, 192.168.1.100)
DEFECT: Inconsistent formatting
DETECT: Mixed heading levels, inconsistent code block style
FIX: Standardize heading hierarchy and formatting
DEFECT: Missing cross-references
DETECT: Mentions tools/concepts covered by other skills without referencing them
FIX: Add "See the X skill for more on Y" notes
DEFECT: Outdated commands
DETECT: docker-compose (v1), python (not python3), npm -g without npx alternative
FIX: Update to current tool versions and syntax
```
## Comparative Review
When comparing skills in the same category:
```
COMPARATIVE CRITERIA:
1. Coverage breadth
Which skill covers more use cases?
2. Example quality
Which has more runnable, realistic examples?
3. Depth on common operations
Which handles the 80% case better?
4. Edge case coverage
Which addresses more gotchas and failure modes?
5. Cross-platform support
Which works across more environments?
6. Freshness
Which uses current tool versions and syntax?
WINNER: [skill A / skill B / tie]
REASON: [1-2 sentence justification]
```
## Quick Review Template
For a fast review when you don't need full scoring:
```markdown
## Quick Review: [skill-name]
**Structure**: [OK / Issues: ...]
**Description**: [Strong / Weak: reason]
**Examples**: [X code blocks across Y lines — density OK/low/high]
**Actionability**: [Agent can/cannot follow these instructions because...]
**Top defect**: [The single most impactful thing to fix]
**Verdict**: [PUBLISH / REVISE / REWORK]
```
## Review Workflow
### Reviewing your own skill before publishing
```bash
# 1. Validate frontmatter
head -20 skills/my-skill/SKILL.md
# Visually confirm YAML is valid
# 2. Count code blocks
grep -c '```' skills/my-skill/SKILL.md
# Divide total lines by thi
... (truncated)
browser
By
Comments
Sign in to leave a comment