Browser
atl-mobile
Mobile browser and native app automation via ATL (iOS Simulator).
---
name: atl-browser
description: Mobile browser and native app automation via ATL (iOS Simulator). Navigate, click, screenshot, and automate web and native app tasks on iPhone/iPad simulators.
metadata:
openclaw:
emoji: "📱"
requires:
bins: ["xcrun", "xcodebuild", "curl"]
install:
- id: "atl-clone"
kind: "shell"
command: "git clone https://github.com/JordanCoin/Atl ~/Atl"
label: "Clone ATL repository"
- id: "atl-setup"
kind: "shell"
command: "~/.openclaw/skills/atl-browser/scripts/setup.sh"
label: "Build and install ATL to simulator"
---
# ATL — Agent Touch Layer
> The automation layer between AI agents and iOS
ATL provides HTTP-based automation for iOS Simulator — both **browser** (mobile Safari) and **native apps**. Think Playwright, but for mobile.
## 🔀 Two Servers: Browser & Native
ATL uses **two separate servers** for browser and native app automation:
| Server | Port | Use Case | Key Commands |
|--------|------|----------|--------------|
| **Browser** | `9222` | Web automation in mobile Safari | `goto`, `markElements`, `clickMark`, `evaluate` |
| **Native** | `9223` | iOS app automation (Settings, Contacts, any app) | `openApp`, `snapshot`, `tapRef`, `find` |
```
┌─────────────────────────────────────────────────────────────┐
│ BROWSER SERVER (9222) │ NATIVE SERVER (9223) │
│ (mobile Safari/WebView) │ (iOS apps via XCTest) │
│ │ │
│ markElements + clickMark │ snapshot + tapRef │
│ CSS selectors │ accessibility tree │
│ DOM evaluation │ element references │
│ tap, swipe, screenshot │ tap, swipe, screenshot │
└─────────────────────────────────────────────────────────────┘
```
**Why two ports?** Native app automation requires XCTest APIs (XCUIApplication, XCUIElement) which are only available in UI Test bundles. The native server runs as a UI Test that exposes an HTTP API.
### Starting the Servers
```bash
# Browser server (starts automatically with AtlBrowser app)
xcrun simctl launch booted com.atl.browser
curl http://localhost:9222/ping # → {"status":"ok"}
# Native server (run as UI Test)
cd ~/Atl/core/AtlBrowser
xcodebuild test -workspace AtlBrowser.xcworkspace \
-scheme AtlBrowser \
-destination 'id=<SIMULATOR_UDID>' \
-only-testing:AtlBrowserUITests/NativeServer/testNativeServer &
# Wait for it to start, then:
curl http://localhost:9223/ping # → {"status":"ok","mode":"native"}
```
### Quick Port Reference
| Task | Port | Example |
|------|------|---------|
| Browse websites | 9222 | `curl localhost:9222/command -d '{"method":"goto",...}'` |
| Open native app | 9223 | `curl localhost:9223/command -d '{"method":"openApp",...}'` |
| Screenshot (browser) | 9222 | `curl localhost:9222/command -d '{"method":"screenshot"}'` |
| Screenshot (native) | 9223 | `curl localhost:9223/command -d '{"method":"screenshot"}'` |
---
## 📱 Native App Automation (Port 9223)
Native automation uses **port 9223** and automates **any iOS app** using the accessibility tree — no DOM, no JavaScript, just direct element interaction.
### Opening & Closing Apps
```bash
# Open an app by bundle ID
curl -s -X POST http://localhost:9223/command \
-d '{"method":"openApp","params":{"bundleId":"com.apple.Preferences"}}'
# → {"success":true,"result":{"bundleId":"com.apple.Preferences","mode":"native","state":"running"}}
# Check current app state
curl -s -X POST http://localhost:9223/command \
-d '{"method":"appState"}'
# → {"success":true,"result":{"mode":"native","bundleId":"com.apple.Preferences","state":"running"}}
# Close current app
curl -s -X POST http://localhost:9223/command \
-d '{"method":"closeApp"}'
# → {"success":true,"result":{"closed":true}}
```
### Common Bundle IDs
| App | Bundle ID |
|-----|-----------|
| Settings | `com.apple.Preferences` |
| Contacts | `com.apple.MobileAddressBook` |
| Calculator | `com.apple.calculator` |
| Calendar | `com.apple.mobilecal` |
| Photos | `com.apple.mobileslideshow` |
| Notes | `com.apple.mobilenotes` |
| Reminders | `com.apple.reminders` |
| Clock | `com.apple.mobiletimer` |
| Maps | `com.apple.Maps` |
| Safari | `com.apple.mobilesafari` |
### The `snapshot` Command
`snapshot` returns the accessibility tree — all visible elements with their properties and tap-able references.
```bash
curl -s -X POST http://localhost:9223/command \
-d '{"method":"snapshot","params":{"interactiveOnly":true}}' | jq '.result'
```
**Example output:**
```json
{
"count": 12,
"elements": [
{
"ref": "e0",
"type": "cell",
"label": "Wi-Fi",
"value": "MyNetwork",
"identifier": "",
"x": 0,
"y": 142,
"width": 393,
"height": 44,
"isHittable": true,
"isEnabled": true
},
{
"ref": "e1",
"type": "cell",
"label": "Bluetooth",
"value": "On",
"identifier": "",
"x": 0,
"y": 186,
"width": 393,
"height": 44,
"isHittable": true,
"isEnabled": true
},
{
"ref": "e2",
"type": "button",
"label": "Back",
"value": null,
"identifier": "Back",
"x": 0,
"y": 44,
"width": 80,
"height": 44,
"isHittable": true,
"isEnabled": true
}
]
}
```
**Parameters:**
- `interactiveOnly` (bool, default: `false`) — Only return hittable elements
- `maxDepth` (int, optional) — Limit tree traversal depth
### The `tapRef` Command
Tap an element by its reference from the last `snapshot`:
```bash
# Take snapshot first
curl -s -X POST http://localhost:9223/command \
-d '{"method":"snapshot","params":{"interactiveOnly":true}}'
# Tap element e0 (Wi-Fi cell from example above)
curl -s -X POST http://localhost:9223/command \
-d '{"method":"tapRef","params":{"ref":"e0"}}'
# → {"success":true}
```
### The `find` Command
Find and interact with elements by text — no need to parse snapshot manually:
```bash
# Find and tap "Wi-Fi"
curl -s -X POST http://localhost:9223/command \
-d '{"method":"find","params":{"text":"Wi-Fi","action":"tap"}}'
# → {"success":true,"result":{"found":true,"ref":"e0"}}
# Check if an element exists
curl -s -X POST http://localhost:9223/command \
-d '{"method":"find","params":{"text":"Bluetooth","action":"exists"}}'
# → {"success":true,"result":{"found":true,"ref":"e1"}}
# Find and fill a text field
curl -s -X POST http://localhost:9223/command \
-d '{"method":"find","params":{"text":"First name","action":"fill","value":"John"}}'
# Get element info without interacting
curl -s -X POST http://localhost:9223/command \
-d '{"method":"find","params":{"text":"Cancel","action":"get"}}'
# → {"success":true,"result":{"found":true,"ref":"e5","element":{...}}}
```
**Parameters:**
- `text` (string) — Text to search for (matches label, value, or identifier)
- `action` (string) — One of: `tap`, `fill`, `exists`, `get`
- `value` (string, optional) — Text to fill (required for `action:"fill"`)
- `by` (string, optional) — Narrow search: `label`, `value`, `identifier`, `type`, or `any` (default)
---
## 🔄 Native App Workflow Example
Here's a complete flow: open Settings, navigate to Wi-Fi, take a screenshot:
```bash
# 1. Open Settings app
curl -s -X POST http://localhost:9223/command \
-d '{"method":"openApp","params":{"bundleId":"com.apple.Preferences"}}'
# 2. Wait for app to launch
sleep 1
# 3. Take snapshot to see available elements
curl -s -X POST http://localhost:9223/command \
-d '{"method":"snapshot","params":{"interactiveOnly":true}}' | jq '.result.elements[:5]'
# 4. Find and tap Wi-Fi
curl -s -X POST http://localhost:9223/command \
-d '{"method":"find","params":{"text":"Wi-Fi","action":"tap"}}'
# 5. Wait for navigation
sleep 0.5
# 6. Take screenshot of Wi-Fi settings
curl -s -X POST http://localhost:9223/command \
-d '{"method":"screenshot"}' | jq -r '.result.data' | base64 -d > /tmp/wifi-settings.png
# 7. Navigate back (swipe right from left edge)
curl -s -X POST http://localhost:9223/command \
-d '{"method":"swipe","params":{"direction":"right"}}'
# 8. Close the app
curl -s -X POST http://localhost:9223/command \
-d '{"method":"closeApp"}'
```
### Helper Script Version
```bash
source ~/.openclaw/skills/atl-browser/scripts/atl-helper.sh
atl_openapp "com.apple.Preferences"
sleep 1
atl_find "Wi-Fi" tap
sleep 0.5
atl_screenshot /tmp/wifi-settings.png
atl_swipe right
atl_closeapp
```
---
## 💡 Core Insight: Vision-Free Automation
ATL's killer feature is **spatial understanding without vision models**:
```
┌─────────────────────────────────────────────────────────────┐
│ markElements + captureForVision = COMPLETE PAGE KNOWLEDGE │
└─────────────────────────────────────────────────────────────┘
1. markElements → Numbers every interactive element [1] [2] [3]
2. captureForVision → PDF with text layer + element coordinates
3. tap x=234 y=567 → Pixel-perfect touch at exact position
```
**Why this matters:**
- **No vision API calls** — zero token cost for "seeing" the page
- **Faster** — no round-trip to GPT-4V/Claude Vision
- **Deterministic** — same page = same coordinates, every time
- **Reliable** — pixel-perfect coordinates vs. vision interpretation
### The Vision-Free Workflow
```bash
# 1. Mark elements (adds numbered labels + stores coordinates)
curl -s -X POST http://localhost:9222/command \
-d '{"id":"1","method":"markElements","params":{}}'
# 2. Capture PDF with text layer (machine-readable, has coordinates)
curl -s -X POST http://localhost:9222/command \
-d '{"id":"2","method":"captureForVision","params":{"savePath":"/tmp","name":"page"}}' \
| jq -r '.result.path'
# → /tmp/page.pdf (text-selectable, contains element positions)
# 3. Get specific element's position by mark label
curl -s -X POST http://localhost:9222/command \
-d '{"id":"3","method":"getMarkInfo","params":{"label":5}}' | jq '
... (truncated)
browser
By
Comments
Sign in to leave a comment