← Back to Skills
Media

auto-whisper-safe

neal-collab By neal-collab 👁 27 views ▲ 0 votes

RAM-safe voice transcription with auto-chunking โ€” works on 16GB machines without crashes.

GitHub
---
name: auto-whisper-safe
version: 1.0.0
description: RAM-safe voice transcription with auto-chunking โ€” works on 16GB machines without crashes
emoji: ๐ŸŽ™๏ธ
tags:
  - whisper
  - transcription
  - voice
  - audio
  - ram-safe
requires:
  bins:
    - whisper
    - ffmpeg
---

# Auto-Whisper Safe โ€” RAM-Friendly Voice Transcription

Transcribe voice messages and long audio files using OpenAI Whisper **without crashing your machine**. Designed for 16GB RAM systems running other processes (like OpenClaw agents).

## The Problem

Whisper's `turbo` and `large` models use 6-10GB RAM. On a 16GB machine running OpenClaw + Ollama + other services, this causes OOM crashes. Existing Whisper skills don't handle this.

## The Solution

1. **Auto-detects audio length** via ffprobe
2. **Splits long audio** (>10min) into 10-min chunks automatically
3. **Uses `base` model** by default (~1.5GB RAM โ€” safe on any 16GB machine)
4. **Merges transcripts** seamlessly โ€” no gaps, no duplicates
5. **Cleans up** temp files automatically

## Usage

```bash
# Basic usage
./transcribe.sh /path/to/audio.ogg

# Custom model (if you have more RAM)
WHISPER_MODEL=small ./transcribe.sh /path/to/audio.ogg

# Custom language
WHISPER_LANG=en ./transcribe.sh /path/to/audio.ogg

# Custom output directory
./transcribe.sh /path/to/audio.ogg /path/to/output/
```

## RAM Usage by Model

| Model | RAM | Speed | Accuracy | Recommended For |
|-------|-----|-------|----------|-----------------|
| `tiny` | ~1GB | โšกโšกโšก | โ˜…โ˜… | Quick previews, low-RAM systems |
| `base` | ~1.5GB | โšกโšก | โ˜…โ˜…โ˜… | **Default โ€” best balance** โœ… |
| `small` | ~2.5GB | โšก | โ˜…โ˜…โ˜…โ˜… | When accuracy matters more |
| `medium` | ~5GB | ๐Ÿข | โ˜…โ˜…โ˜…โ˜…โ˜… | 32GB+ RAM only |
| `turbo` | ~6GB | ๐Ÿข๐Ÿข | โ˜…โ˜…โ˜…โ˜…โ˜… | Dedicated transcription machines |

## OpenClaw Integration

Add to your agent's `BOOTSTRAP.md`:

```markdown
## Voice Message Handling

When you receive `<media:audio>`, ALWAYS transcribe first:

1. Run: `./skills/auto-whisper-safe/transcribe.sh <audio-path>`
2. Read the output transcript file
3. Respond based on the transcribed content

Do this automatically โ€” voice messages are meant to be transcribed.
```

## Environment Variables

| Variable | Default | Description |
|----------|---------|-------------|
| `WHISPER_MODEL` | `base` | Whisper model size |
| `WHISPER_LANG` | `en` | Audio language (ISO code) |

## How Chunking Works

- Audio โ‰ค10min โ†’ transcribed directly (no splitting)
- Audio >10min โ†’ split into 10-min segments via ffmpeg
- Each segment transcribed independently
- Transcripts concatenated in order
- Temp files cleaned up on exit (even on errors)

## Installation

```bash
# macOS
brew install openai-whisper ffmpeg

# Ubuntu/Debian
pip install openai-whisper
apt install ffmpeg

# Verify
whisper --help && ffmpeg -version
```

## Why This Over Other Whisper Skills

- โœ… **RAM-safe**: Won't crash your 16GB machine
- โœ… **Auto-chunking**: Handles 1-hour podcasts without issues
- โœ… **Cleanup**: No temp files left behind
- โœ… **Progress**: Shows chunk-by-chunk progress
- โœ… **Configurable**: Model + language via env vars
- โœ… **OpenClaw-native**: Drop-in for any agent's BOOTSTRAP.md

## Real-World Performance

Tested on Ubuntu 22.04, 16GB RAM, running OpenClaw (10 agents) + Ollama simultaneously:

| Audio Length | Model | RAM Peak | Time | Result |
|-------------|-------|----------|------|--------|
| 2 min voice memo | base | 1.4GB | ~15s | โœ… Perfect |
| 12 min podcast clip | base | 1.5GB (chunked) | ~90s | โœ… 2 chunks, seamless |
| 45 min interview | base | 1.5GB (chunked) | ~6min | โœ… 5 chunks, seamless |
| 2 min voice memo | tiny | 0.9GB | ~8s | โœ… Good enough for quick reads |

## Supported Audio Formats

ffmpeg handles the conversion, so virtually any format works:
- โœ… `.ogg` (Telegram voice messages)
- โœ… `.mp3`, `.m4a`, `.wav`, `.flac`
- โœ… `.webm` (browser recordings)
- โœ… `.opus` (WhatsApp voice messages)

## Changelog

### v1.0.0
- Initial release
- Auto-chunking for long audio (>10min)
- RAM-safe defaults (base model, 1.5GB)
- Progress tracking per chunk
- Automatic temp file cleanup
- Configurable model and language
media

Comments

Sign in to leave a comment

Loading comments...