anycrawl

Name: anycrawl
Rating: 3.5 (1 reviews)
Author: techlaai

By techlaai 👁 4 views ▲ 0 votes

AnyCrawl API integration for OpenClaw - Scrape, Crawl, and Search web

GitHub

# AnyCrawl Skill

AnyCrawl API integration for OpenClaw - Scrape, Crawl, and Search web content with high-performance multi-threaded crawling.

## Setup

### Method 1: Environment variable (Recommended)

```bash
export ANYCRAWL_API_KEY="your-api-key"
```

Make it permanent by adding to `~/.bashrc` or `~/.zshrc`:
```bash
echo 'export ANYCRAWL_API_KEY="your-api-key"' >> ~/.bashrc
source ~/.bashrc
```

Get your API key at: https://anycrawl.dev

### Method 2: OpenClaw gateway config

```bash
openclaw config.patch --set ANYCRAWL_API_KEY="your-api-key"
```

## Functions

### 1. anycrawl_scrape

Scrape a single URL and convert to LLM-ready structured data.

**Parameters:**
- `url` (string, required): URL to scrape
- `engine` (string, optional): Scraping engine - `"cheerio"` (default), `"playwright"`, `"puppeteer"`
- `formats` (array, optional): Output formats - `["markdown"]`, `["html"]`, `["text"]`, `["json"]`, `["screenshot"]`
- `timeout` (number, optional): Timeout in milliseconds (default: 30000)
- `wait_for` (number, optional): Delay before extraction in ms (browser engines only)
- `wait_for_selector` (string/object/array, optional): Wait for CSS selectors
- `include_tags` (array, optional): Include only these HTML tags (e.g., `["h1", "p", "article"]`)
- `exclude_tags` (array, optional): Exclude these HTML tags
- `proxy` (string, optional): Proxy URL (e.g., `"http://proxy:port"`)
- `json_options` (object, optional): JSON extraction with schema/prompt
- `extract_source` (string, optional): `"markdown"` (default) or `"html"`

**Examples:**

```javascript
// Basic scrape with default cheerio
anycrawl_scrape({ url: "https://example.com" })

// Scrape SPA with Playwright
anycrawl_scrape({ 
  url: "https://spa-example.com",
  engine: "playwright",
  formats: ["markdown", "screenshot"]
})

// Extract structured JSON
anycrawl_scrape({
  url: "https://product-page.com",
  engine: "cheerio",
  json_options: {
    schema: {
      type: "object",
      properties: {
        product_name: { type: "string" },
        price: { type: "number" },
        description: { type: "string" }
      },
      required: ["product_name", "price"]
    },
    user_prompt: "Extract product details from this page"
  }
})
```

### 2. anycrawl_search

Search Google and return structured results.

**Parameters:**
- `query` (string, required): Search query
- `engine` (string, optional): Search engine - `"google"` (default)
- `limit` (number, optional): Max results per page (default: 10)
- `offset` (number, optional): Number of results to skip (default: 0)
- `pages` (number, optional): Number of pages to retrieve (default: 1, max: 20)
- `lang` (string, optional): Language locale (e.g., `"en"`, `"zh"`, `"vi"`)
- `safe_search` (number, optional): 0 (off), 1 (medium), 2 (high)
- `scrape_options` (object, optional): Scrape each result URL with these options

**Examples:**

```javascript
// Basic search
anycrawl_search({ query: "OpenAI ChatGPT" })

// Multi-page search in Vietnamese
anycrawl_search({ 
  query: "hướng dẫn Node.js",
  pages: 3,
  lang: "vi"
})

// Search and auto-scrape results
anycrawl_search({
  query: "best AI tools 2026",
  limit: 5,
  scrape_options: {
    engine: "cheerio",
    formats: ["markdown"]
  }
})
```

### 3. anycrawl_crawl_start

Start crawling an entire website (async job).

**Parameters:**
- `url` (string, required): Seed URL to start crawling
- `engine` (string, optional): `"cheerio"` (default), `"playwright"`, `"puppeteer"`
- `strategy` (string, optional): `"all"`, `"same-domain"` (default), `"same-hostname"`, `"same-origin"`
- `max_depth` (number, optional): Max depth from seed URL (default: 10)
- `limit` (number, optional): Max pages to crawl (default: 100)
- `include_paths` (array, optional): Path patterns to include (e.g., `["/blog/*"]`)
- `exclude_paths` (array, optional): Path patterns to exclude (e.g., `["/admin/*"]`)
- `scrape_paths` (array, optional): Only scrape URLs matching these patterns
- `scrape_options` (object, optional): Per-page scrape options

**Examples:**

```javascript
// Crawl entire website
anycrawl_crawl_start({ 
  url: "https://docs.example.com",
  engine: "cheerio",
  max_depth: 5,
  limit: 50
})

// Crawl only blog posts
anycrawl_crawl_start({
  url: "https://example.com",
  strategy: "same-domain",
  include_paths: ["/blog/*"],
  exclude_paths: ["/blog/tags/*"],
  scrape_options: {
    formats: ["markdown"]
  }
})

// Crawl product pages only
anycrawl_crawl_start({
  url: "https://shop.example.com",
  strategy: "same-domain",
  scrape_paths: ["/products/*"],
  limit: 200
})
```

### 4. anycrawl_crawl_status

Check crawl job status.

**Parameters:**
- `job_id` (string, required): Crawl job ID

**Example:**
```javascript
anycrawl_crawl_status({ job_id: "7a2e165d-8f81-4be6-9ef7-23222330a396" })
```

### 5. anycrawl_crawl_results

Get crawl results (paginated).

**Parameters:**
- `job_id` (string, required): Crawl job ID
- `skip` (number, optional): Number of results to skip (default: 0)

**Example:**
```javascript
// Get first 100 results
anycrawl_crawl_results({ job_id: "xxx", skip: 0 })

// Get next 100 results
anycrawl_crawl_results({ job_id: "xxx", skip: 100 })
```

### 6. anycrawl_crawl_cancel

Cancel a running crawl job.

**Parameters:**
- `job_id` (string, required): Crawl job ID

### 7. anycrawl_search_and_scrape

Quick helper: Search Google then scrape top results.

**Parameters:**
- `query` (string, required): Search query
- `max_results` (number, optional): Max results to scrape (default: 3)
- `scrape_engine` (string, optional): Engine for scraping (default: `"cheerio"`)
- `formats` (array, optional): Output formats (default: `["markdown"]`)
- `lang` (string, optional): Search language

**Example:**
```javascript
anycrawl_search_and_scrape({
  query: "latest AI news",
  max_results: 5,
  formats: ["markdown"]
})
```

## Engine Selection Guide

| Engine | Best For | Speed | JS Rendering |
|--------|----------|-------|--------------|
| `cheerio` | Static HTML, news, blogs | ⚡ Fastest | ❌ No |
| `playwright` | SPAs, complex web apps | 🐢 Slower | ✅ Yes |
| `puppeteer` | Chrome-specific, metrics | 🐢 Slower | ✅ Yes |

## Response Format

All responses follow this structure:

```json
{
  "success": true,
  "data": { ... },
  "message": "Optional message"
}
```

Error response:
```json
{
  "success": false,
  "error": "Error type",
  "message": "Human-readable message"
}
```

## Common Error Codes

- `400` - Bad Request (validation errors)
- `401` - Unauthorized (invalid API key)
- `402` - Payment Required (insufficient credits)
- `404` - Not Found
- `429` - Rate limit exceeded
- `500` - Internal server error

## API Limits

- Rate limits apply based on your plan
- Crawl jobs expire after 24 hours
- Max crawl limit: depends on credits

## Links

- API Docs: https://docs.anycrawl.dev
- Website: https://anycrawl.dev
- Playground: https://anycrawl.dev/playground

browser

Comments

Loading comments...

More by techlaai

gdocs-markdown

Create Google Docs from Markdown files.

Similar Skills in Browser

zentao

Use the zentao CLI to login and query ZenTao products and bugs.

youbaolian

Manage youbaolian, orders, users, organ REST API.

wpclaw-lite

Connects to a WooCommerce store via the WPClaw Connector

wolt-orders

Discover restaurants with advanced filters

wishfinity

Add "save for later" to shopping agents, product recommendation

willhaben

Create and manage listings on Willhaben.at (Austrian marketplace).