← Back to Skills
Browser

atl-mobile

jordancoin By jordancoin 👁 3 views ▲ 0 votes

Mobile browser and native app automation via ATL (iOS Simulator).

GitHub
---
name: atl-browser
description: Mobile browser and native app automation via ATL (iOS Simulator). Navigate, click, screenshot, and automate web and native app tasks on iPhone/iPad simulators.
metadata:
  openclaw:
    emoji: "📱"
    requires:
      bins: ["xcrun", "xcodebuild", "curl"]
    install:
      - id: "atl-clone"
        kind: "shell"
        command: "git clone https://github.com/JordanCoin/Atl ~/Atl"
        label: "Clone ATL repository"
      - id: "atl-setup"
        kind: "shell" 
        command: "~/.openclaw/skills/atl-browser/scripts/setup.sh"
        label: "Build and install ATL to simulator"
---

# ATL — Agent Touch Layer

> The automation layer between AI agents and iOS

ATL provides HTTP-based automation for iOS Simulator — both **browser** (mobile Safari) and **native apps**. Think Playwright, but for mobile.

## 🔀 Two Servers: Browser & Native

ATL uses **two separate servers** for browser and native app automation:

| Server | Port | Use Case | Key Commands |
|--------|------|----------|--------------|
| **Browser** | `9222` | Web automation in mobile Safari | `goto`, `markElements`, `clickMark`, `evaluate` |
| **Native** | `9223` | iOS app automation (Settings, Contacts, any app) | `openApp`, `snapshot`, `tapRef`, `find` |

```
┌─────────────────────────────────────────────────────────────┐
│  BROWSER SERVER (9222)     │     NATIVE SERVER (9223)      │
│  (mobile Safari/WebView)   │     (iOS apps via XCTest)     │
│                            │                                │
│  markElements + clickMark  │     snapshot + tapRef         │
│  CSS selectors             │     accessibility tree        │
│  DOM evaluation            │     element references        │
│  tap, swipe, screenshot    │     tap, swipe, screenshot    │
└─────────────────────────────────────────────────────────────┘
```

**Why two ports?** Native app automation requires XCTest APIs (XCUIApplication, XCUIElement) which are only available in UI Test bundles. The native server runs as a UI Test that exposes an HTTP API.

### Starting the Servers

```bash
# Browser server (starts automatically with AtlBrowser app)
xcrun simctl launch booted com.atl.browser
curl http://localhost:9222/ping  # → {"status":"ok"}

# Native server (run as UI Test)
cd ~/Atl/core/AtlBrowser
xcodebuild test -workspace AtlBrowser.xcworkspace \
  -scheme AtlBrowser \
  -destination 'id=<SIMULATOR_UDID>' \
  -only-testing:AtlBrowserUITests/NativeServer/testNativeServer &
  
# Wait for it to start, then:
curl http://localhost:9223/ping  # → {"status":"ok","mode":"native"}
```

### Quick Port Reference

| Task | Port | Example |
|------|------|---------|
| Browse websites | 9222 | `curl localhost:9222/command -d '{"method":"goto",...}'` |
| Open native app | 9223 | `curl localhost:9223/command -d '{"method":"openApp",...}'` |
| Screenshot (browser) | 9222 | `curl localhost:9222/command -d '{"method":"screenshot"}'` |
| Screenshot (native) | 9223 | `curl localhost:9223/command -d '{"method":"screenshot"}'` |

---

## 📱 Native App Automation (Port 9223)

Native automation uses **port 9223** and automates **any iOS app** using the accessibility tree — no DOM, no JavaScript, just direct element interaction.

### Opening & Closing Apps

```bash
# Open an app by bundle ID
curl -s -X POST http://localhost:9223/command \
  -d '{"method":"openApp","params":{"bundleId":"com.apple.Preferences"}}'
# → {"success":true,"result":{"bundleId":"com.apple.Preferences","mode":"native","state":"running"}}

# Check current app state
curl -s -X POST http://localhost:9223/command \
  -d '{"method":"appState"}'
# → {"success":true,"result":{"mode":"native","bundleId":"com.apple.Preferences","state":"running"}}

# Close current app
curl -s -X POST http://localhost:9223/command \
  -d '{"method":"closeApp"}'
# → {"success":true,"result":{"closed":true}}
```

### Common Bundle IDs

| App | Bundle ID |
|-----|-----------|
| Settings | `com.apple.Preferences` |
| Contacts | `com.apple.MobileAddressBook` |
| Calculator | `com.apple.calculator` |
| Calendar | `com.apple.mobilecal` |
| Photos | `com.apple.mobileslideshow` |
| Notes | `com.apple.mobilenotes` |
| Reminders | `com.apple.reminders` |
| Clock | `com.apple.mobiletimer` |
| Maps | `com.apple.Maps` |
| Safari | `com.apple.mobilesafari` |

### The `snapshot` Command

`snapshot` returns the accessibility tree — all visible elements with their properties and tap-able references.

```bash
curl -s -X POST http://localhost:9223/command \
  -d '{"method":"snapshot","params":{"interactiveOnly":true}}' | jq '.result'
```

**Example output:**
```json
{
  "count": 12,
  "elements": [
    {
      "ref": "e0",
      "type": "cell",
      "label": "Wi-Fi",
      "value": "MyNetwork",
      "identifier": "",
      "x": 0,
      "y": 142,
      "width": 393,
      "height": 44,
      "isHittable": true,
      "isEnabled": true
    },
    {
      "ref": "e1",
      "type": "cell",
      "label": "Bluetooth",
      "value": "On",
      "identifier": "",
      "x": 0,
      "y": 186,
      "width": 393,
      "height": 44,
      "isHittable": true,
      "isEnabled": true
    },
    {
      "ref": "e2",
      "type": "button",
      "label": "Back",
      "value": null,
      "identifier": "Back",
      "x": 0,
      "y": 44,
      "width": 80,
      "height": 44,
      "isHittable": true,
      "isEnabled": true
    }
  ]
}
```

**Parameters:**
- `interactiveOnly` (bool, default: `false`) — Only return hittable elements
- `maxDepth` (int, optional) — Limit tree traversal depth

### The `tapRef` Command

Tap an element by its reference from the last `snapshot`:

```bash
# Take snapshot first
curl -s -X POST http://localhost:9223/command \
  -d '{"method":"snapshot","params":{"interactiveOnly":true}}'

# Tap element e0 (Wi-Fi cell from example above)
curl -s -X POST http://localhost:9223/command \
  -d '{"method":"tapRef","params":{"ref":"e0"}}'
# → {"success":true}
```

### The `find` Command

Find and interact with elements by text — no need to parse snapshot manually:

```bash
# Find and tap "Wi-Fi"
curl -s -X POST http://localhost:9223/command \
  -d '{"method":"find","params":{"text":"Wi-Fi","action":"tap"}}'
# → {"success":true,"result":{"found":true,"ref":"e0"}}

# Check if an element exists
curl -s -X POST http://localhost:9223/command \
  -d '{"method":"find","params":{"text":"Bluetooth","action":"exists"}}'
# → {"success":true,"result":{"found":true,"ref":"e1"}}

# Find and fill a text field
curl -s -X POST http://localhost:9223/command \
  -d '{"method":"find","params":{"text":"First name","action":"fill","value":"John"}}'

# Get element info without interacting
curl -s -X POST http://localhost:9223/command \
  -d '{"method":"find","params":{"text":"Cancel","action":"get"}}'
# → {"success":true,"result":{"found":true,"ref":"e5","element":{...}}}
```

**Parameters:**
- `text` (string) — Text to search for (matches label, value, or identifier)
- `action` (string) — One of: `tap`, `fill`, `exists`, `get`
- `value` (string, optional) — Text to fill (required for `action:"fill"`)
- `by` (string, optional) — Narrow search: `label`, `value`, `identifier`, `type`, or `any` (default)

---

## 🔄 Native App Workflow Example

Here's a complete flow: open Settings, navigate to Wi-Fi, take a screenshot:

```bash
# 1. Open Settings app
curl -s -X POST http://localhost:9223/command \
  -d '{"method":"openApp","params":{"bundleId":"com.apple.Preferences"}}'

# 2. Wait for app to launch
sleep 1

# 3. Take snapshot to see available elements
curl -s -X POST http://localhost:9223/command \
  -d '{"method":"snapshot","params":{"interactiveOnly":true}}' | jq '.result.elements[:5]'

# 4. Find and tap Wi-Fi
curl -s -X POST http://localhost:9223/command \
  -d '{"method":"find","params":{"text":"Wi-Fi","action":"tap"}}'

# 5. Wait for navigation
sleep 0.5

# 6. Take screenshot of Wi-Fi settings
curl -s -X POST http://localhost:9223/command \
  -d '{"method":"screenshot"}' | jq -r '.result.data' | base64 -d > /tmp/wifi-settings.png

# 7. Navigate back (swipe right from left edge)
curl -s -X POST http://localhost:9223/command \
  -d '{"method":"swipe","params":{"direction":"right"}}'

# 8. Close the app
curl -s -X POST http://localhost:9223/command \
  -d '{"method":"closeApp"}'
```

### Helper Script Version

```bash
source ~/.openclaw/skills/atl-browser/scripts/atl-helper.sh

atl_openapp "com.apple.Preferences"
sleep 1
atl_find "Wi-Fi" tap
sleep 0.5
atl_screenshot /tmp/wifi-settings.png
atl_swipe right
atl_closeapp
```

---

## 💡 Core Insight: Vision-Free Automation

ATL's killer feature is **spatial understanding without vision models**:

```
┌─────────────────────────────────────────────────────────────┐
│  markElements + captureForVision = COMPLETE PAGE KNOWLEDGE  │
└─────────────────────────────────────────────────────────────┘

1. markElements  → Numbers every interactive element [1] [2] [3]
2. captureForVision → PDF with text layer + element coordinates
3. tap x=234 y=567 → Pixel-perfect touch at exact position
```

**Why this matters:**
- **No vision API calls** — zero token cost for "seeing" the page
- **Faster** — no round-trip to GPT-4V/Claude Vision
- **Deterministic** — same page = same coordinates, every time
- **Reliable** — pixel-perfect coordinates vs. vision interpretation

### The Vision-Free Workflow

```bash
# 1. Mark elements (adds numbered labels + stores coordinates)
curl -s -X POST http://localhost:9222/command \
  -d '{"id":"1","method":"markElements","params":{}}'

# 2. Capture PDF with text layer (machine-readable, has coordinates)
curl -s -X POST http://localhost:9222/command \
  -d '{"id":"2","method":"captureForVision","params":{"savePath":"/tmp","name":"page"}}' \
  | jq -r '.result.path'
# → /tmp/page.pdf (text-selectable, contains element positions)

# 3. Get specific element's position by mark label
curl -s -X POST http://localhost:9222/command \
  -d '{"id":"3","method":"getMarkInfo","params":{"label":5}}' | jq '

... (truncated)
browser

Comments

Sign in to leave a comment

Loading comments...