← Back to Plugins
Tools

Signal Hunter

fellis By fellis 👁 1 views ▲ 0 votes

Signal Hunter - AI/ML market intelligence plugin for OpenClaw. Monitors GitHub, Reddit, HN, StackOverflow. Semantic search via Qdrant + bge-m3.

GitHub

Install

pip install -r

Configuration Example

{
  "id": "signal-hunter",
  "path": "~/.openclaw/extensions/signal-hunter",
  "config": {
    "pythonBin": "python3",
    "skillDir": null
  }
}

README

# Signal Hunter - OpenClaw Plugin

Market intelligence for AI/ML builders. Monitors GitHub, Reddit, Hacker News and Stack Overflow for signals: developer pain points, feature requests, tool comparisons. Manages everything through a chat interface via [OpenClaw](https://github.com/openclaw).

---

## What it does

You type a keyword ("RAG", "ollama", "LangChain") in chat. Signal Hunter:

1. **Discovers** where the topic is discussed (repos, subreddits, SO tags) via real API calls
2. **Proposes a collection plan** - which repos/queries to monitor
3. **Collects incrementally** (GitHub issues, Reddit posts, HN threads, SO questions) using cursors
4. **Classifies** every signal with a local LLM using your extraction rules (pain points, feature requests, comparisons, adoption...)
5. **Embeds** relevant signals into Qdrant with `bge-m3`
6. **Answers questions** in natural language: "what are the top complaints about RAG retrieval this month?"
7. **Generates change reports** - weekly/monthly deltas with what's new and what grew

Everything runs on a VPS, fully offline (except for API calls to GitHub/Reddit/SO and the LLM providers you configure).

---

## Stack

| Component | Role |
|---|---|
| **Python 3.11+** | Core skill logic |
| **TypeScript** | OpenClaw plugin adapter (thin wrapper) |
| **PostgreSQL 16** | Structured storage: signals, profiles, cursors, LLM cost log |
| **Qdrant** | Vector search (cosine, 1024 dims) |
| **BAAI/bge-m3** | Cross-lingual embeddings via `sentence-transformers` |
| **Local LLM** | Classification, rule suggestions (OpenAI-compatible endpoint) |
| **Claude (Anthropic)** | Queries, resolution strategy (configurable) |
| **Docker Compose** | PostgreSQL + Qdrant services |

---

## Architecture

```
OpenClaw chat
     โ”‚
     โ–ผ
src/index.ts          โ† OpenClaw plugin entry (register tools + /sh command)
src/tools.ts          โ† 22 tool definitions (thin TS wrappers)
src/runner.ts         โ† spawns: python -m skill <command> [args]
     โ”‚
     โ–ผ  JSON via stdout
skill/main.py         โ† CLI dispatcher (22 commands)
     โ”‚
     โ”œโ”€โ”€ core/resolver.py      โ† keyword discovery + LLM enrichment
     โ”œโ”€โ”€ core/orchestrator.py  โ† collect โ†’ process โ†’ embed pipeline
     โ”œโ”€โ”€ core/processor.py     โ† LLM classification (token-aware batching)
     โ”œโ”€โ”€ core/embedder.py      โ† bge-m3 โ†’ Qdrant (Outbox pattern)
     โ”œโ”€โ”€ core/llm_router.py    โ† routes ops to local/Claude by config
     โ”‚
     โ”œโ”€โ”€ collectors/
     โ”‚   โ”œโ”€โ”€ github.py         โ† GitHub Issues (repo-scoped, cursor on updated_at)
     โ”‚   โ”œโ”€โ”€ reddit.py         โ† Reddit JSON API (no auth for public subs)
     โ”‚   โ”œโ”€โ”€ hackernews.py     โ† Algolia HN API (no auth)
     โ”‚   โ””โ”€โ”€ stackoverflow.py  โ† Stack Exchange API v2.3
     โ”‚
     โ””โ”€โ”€ storage/
         โ”œโ”€โ”€ postgres.py       โ† all SQL (raw_signals, processed_signals, cursors...)
         โ”œโ”€โ”€ vector.py         โ† Qdrant wrapper
         โ””โ”€โ”€ config_manager.py โ† atomic config.json writes (temp file + rename)
```

**Design principles:**
- Each collector is a self-contained module implementing `BaseCollector`
- Business logic stays in Python; TypeScript only handles IPC
- Discovery-first: LLM enriches only facts confirmed by API calls, never guesses
- Token-aware batching for LLM classification (validated: ~20K tokens per batch)
- Outbox pattern for embedding queue (PostgreSQL โ†’ Qdrant, crash-safe)
- Anti-hallucination gate on query answers: URLs not in source data are stripped

---

## Prerequisites

- VPS or local machine with Python 3.11+
- Docker + Docker Compose (for PostgreSQL and Qdrant)
- [OpenClaw](https://github.com/openclaw) installed
- Local LLM with OpenAI-compatible API (e.g. [Ollama](https://ollama.com) with Devstral, Mistral, etc.)
- Anthropic API key (for queries and keyword resolution strategy)

---

## Installation

### 1. Clone and place the plugin

```bash
git clone https://github.com/fellis/openclaw-signal-hunter.git
cd openclaw-signal-hunter
```

Place (or symlink) the directory into your OpenClaw extensions folder:

```bash
ln -s $(pwd) ~/.openclaw/extensions/signal-hunter
```

### 2. Configure environment

```bash
cp .env.example .env
```

Edit `.env`:

```env
# PostgreSQL (docker-compose exposes 5433 on host)
DATABASE_URL=postgresql://signal:signal@localhost:5433/signal_hunter

# Qdrant
QDRANT_URL=http://localhost:6333

# GitHub (optional - public rate limit 60 req/hr works for testing)
GITHUB_TOKEN=ghp_your_token_here

# Local LLM (OpenAI-compatible endpoint)
LOCAL_LLM_BASE_URL=http://localhost:11434/v1
LOCAL_LLM_API_KEY=local
LOCAL_LLM_MODEL=devstral

# Anthropic Claude
ANTHROPIC_API_KEY=sk-ant-your_key_here
```

### 3. Start infrastructure

```bash
docker compose up -d
```

This starts PostgreSQL 16 (port 5433) and Qdrant (port 6333). The schema is applied automatically on first start.

### 4. Install Python dependencies

```bash
pip install -r requirements.txt
```

### 5. Register the plugin in OpenClaw

Add to your `~/.openclaw/openclaw.json` under `plugins`:

```json
{
  "id": "signal-hunter",
  "path": "~/.openclaw/extensions/signal-hunter",
  "config": {
    "pythonBin": "python3",
    "skillDir": null
  }
}
```

Reload OpenClaw. The plugin registers 22 tools and the `/sh` slash command.

---

## Usage walkthrough

Here is a complete scenario: tracking developer pain points around **RAG** and **ollama**.

### Step 1 - Check source readiness

```
You: source status
```

```
ClawBot:
โœ“ hackernews  - no auth required
โœ— github      - no GITHUB_TOKEN
โœ— reddit      - public API ready, ~10 req/min
โœ— stackoverflow - 300 req/day without key
```

```
You: how do I set up GitHub?
```

```
ClawBot:
1. Go to github.com โ†’ Settings โ†’ Developer settings โ†’ Personal access tokens โ†’ Tokens (classic)
2. Click "Generate new token (classic)"
3. Name: signal-hunter, Scopes: public_repo
4. Copy the token and type: github token=<your_token>
```

```
You: github token=ghp_xxxxxxxxxxxxxxxxxxx
```

```
ClawBot: โœ“ GitHub ready. Limit: 5000 req/hr
```

---

### Step 2 - Configure LLM providers

```
You: show providers
```

```
ClawBot:
Provider | Type           | Model             | Operations
---------|----------------|-------------------|---------------------------
local    | openai_compat  | devstral          | process, suggest_rules, resolve_enrich
claude   | anthropic      | claude-sonnet-4-5 | resolve_strategy, query
```

To route an operation to a different provider:

```
You: use claude for classification
```

```
ClawBot: โœ“ llm_routing.process โ†’ claude
```

---

### Step 3 - Add keywords and discover

```
You: start monitoring RAG, ollama, LangChain
```

Signal Hunter calls each source API (no LLM guessing) and returns discovered resources:

```
ClawBot:
DISCOVERY: RAG
  GitHub: langchain-ai/langchain (88k stars), run-llama/llama_index (35k)
  Reddit: r/LocalLLaMA (245k subs), r/LangChain (42k)
  HN: 89 threads in last 90 days, avg score 124
  SO: 1820 questions with tag [rag]

DISCOVERY: ollama
  GitHub: ollama/ollama (85k stars, 2340 open issues)
  Reddit: r/ollama (52k subs), r/LocalLLaMA

Proposed collection plan:
  RAG: github(langchain-ai/langchain, run-llama/llama_index), reddit(r/LocalLLaMA, r/LangChain), hn(search "RAG"), so(tag:rag)
  ollama: github(ollama/ollama), reddit(r/ollama, r/LocalLLaMA), hn, so

Approve all? Or adjust targets?
```

```
You: for RAG remove MachineLearning - too broad. Approve the rest.
```

```
ClawBot: โœ“ Plans saved for 3 keywords.
```

---

### Step 4 - Collect

```
You: collect
```

```
ClawBot: Running in background... (first run: 10-20 min for 90 days of history)
```

```
ClawBot:
โœ“ Collection done:

Keyword    | GitHub | Reddit | HN  | SO  | Total
-----------|--------|--------|-----|-----|------
RAG        | 400    | 312    | 89  | 200 | 1001
ollama     | 200    | 287    | 43  | 98  | 628
LangChain  | 200    | 198    | 67  | 200 | 665

2294 raw signals in database.
```

---

### Step 5 - Suggest and approve classification rules

```
You: suggest rules for RAG
```

The LLM analyzes actual posts from the database and proposes rules grounded in real content:

```
ClawBot:
Suggested extraction_rules for "RAG" (from 1001 real posts):

1. retrieval_failure (priority 5)
   When retrieval returns irrelevant docs or causes hallucinations
   Real examples: "RAG keeps returning wrong documents even with good embeddings"
                  "chunks are too large and context gets diluted"

2. chunking_strategy (priority 4)
   Questions and pain around document splitting strategies
   Real examples: "what's the optimal chunk size for technical docs?"
                  "sliding window chunking loses context between chunks"

3. embedding_model_choice (priority 4)
   Comparisons and selection of embedding models
   Real examples: "OpenAI ada vs bge-m3 for multilingual RAG"

4. reranking (priority 3)
   Requests and pain around the reranking step
   Real examples: "cohere reranker doubles cost but worth it?"

5. evaluation_difficulty (priority 3)
   Difficulty evaluating RAG pipeline quality
   Real examples: "how do you even measure RAG quality without ground truth?"

Approve all?
```

```
You: approve, but rename evaluation_difficulty to rag_eval
```

```
ClawBot: โœ“ 5 rules saved to config.json
```

---

### Step 6 - Process (LLM classification)

```
You: process
```

```
ClawBot: Classifying 2294 signals with local LLM... (token-aware batching, ~20K tokens/batch)
โœ“ Done. 1847 classified (relevant: 934, irrelevant: 913)
```

Then embed for semantic search:

```
You: embed
```

```
ClawBot: โœ“ 934 vectors indexed into Qdrant (bge-m3, 1024 dims)
```

---

### Step 7 - Query the database

```
You: what are the top pain points with RAG retrieval this month?
```

Signal Hunter embeds the question, searches Qdrant for the most similar signals, re-ranks by `rank_score` (engagement + quality + time decay), then uses Claude to synthesize:

```
ClawBot:
Top RAG retrieval pain points (from 23 signals, last 30 days):

1. **Context dilution with large chunks** (14 signals)
   Developers re

... (truncated)
tools

Comments

Sign in to leave a comment

Loading comments...