Browser Automation with agent-browser
Core Workflow
- •
agent-browser open <url>— navigate to page - •
agent-browser snapshot -i— get interactive elements with refs (@e1,@e2) - •Interact using refs from snapshot
- •Re-snapshot after navigation or significant DOM changes
Execution Rules
- •Always synchronous — Run all
agent-browsercommands via Bash with default settings. Never userun_in_background: true. Commands execute sequentially — wait for each to complete before running the next - •Always headless — Never use
--headedor setheaded: trueunless the user explicitly requests a visible browser. The browser must be invisible — no windows, no popups, no stealing focus
Commands
Navigation
agent-browser open <url> agent-browser back agent-browser forward agent-browser reload agent-browser close
Snapshot (page analysis)
agent-browser snapshot # Full accessibility tree agent-browser snapshot -i # Interactive elements only (recommended) agent-browser snapshot -i -C # Include cursor-interactive elements (onclick divs, etc.) agent-browser snapshot -c # Compact (remove empty structural elements) agent-browser snapshot -d 3 # Limit depth to 3 agent-browser snapshot -s "#main" # Scope to CSS selector agent-browser snapshot -i -c -d 5 # Combine options
Interactions (use @refs from snapshot)
agent-browser click @e1 # Click agent-browser click @e1 --new-tab # Click and open in new tab agent-browser dblclick @e1 # Double-click agent-browser fill @e2 "text" # Clear and type agent-browser type @e2 "text" # Type without clearing agent-browser press Enter # Press key agent-browser press Control+a # Key combination agent-browser hover @e1 # Hover agent-browser focus @e1 # Focus element agent-browser check @e1 # Check checkbox agent-browser uncheck @e1 # Uncheck checkbox agent-browser select @e1 "value" # Select dropdown agent-browser scroll down 500 # Scroll page agent-browser scroll down 500 --selector "div.content" # Scroll within container agent-browser scrollintoview @e1 # Scroll element into view agent-browser keyboard type "text" # Type at current focus (no selector) agent-browser keyboard inserttext "text" # Insert text without key events agent-browser drag @e1 @e2 # Drag and drop (element to element) agent-browser drag "#draggable" "#drop-zone" # Drag by CSS selector agent-browser upload @e1 file.pdf # Upload file(s) agent-browser download @e1 ./file.pdf # Click element to trigger download agent-browser keydown Shift # Hold key down agent-browser keyup Shift # Release key
Mouse (low-level control)
For canvas apps, maps, drawing tools, and precision drag operations where drag isn't enough:
agent-browser mouse move 100 200 # Move to coordinates agent-browser mouse down # Press left button (default) agent-browser mouse down right # Press right button agent-browser mouse down middle # Press middle button agent-browser mouse up # Release left button (default) agent-browser mouse up right # Release right button agent-browser mouse wheel 100 # Scroll down (positive = down) agent-browser mouse wheel -50 # Scroll up agent-browser mouse wheel 0 100 # Scroll right (horizontal)
Manual drag pattern (for canvas/map/drawing where element refs don't apply):
agent-browser mouse move 200 300 # Move to start position agent-browser mouse down # Press agent-browser mouse move 500 300 # Drag to end position agent-browser mouse up # Release
Shift-drag / modifier-drag:
agent-browser keydown Shift agent-browser mouse move 200 300 agent-browser mouse down agent-browser mouse move 500 300 agent-browser mouse up agent-browser keyup Shift
Selectors (alternatives to @refs)
agent-browser click "#id" # CSS selector agent-browser click ".class" # CSS class agent-browser click "text=Submit" # Text content agent-browser click "xpath=//button" # XPath agent-browser find role button click --name "Submit" # Semantic locator agent-browser find label "Email" fill "user@test.com" # By label
Get information
agent-browser get text @e1 # Get element text agent-browser get html @e1 # Get innerHTML agent-browser get value @e1 # Get input value agent-browser get attr @e1 href # Get attribute agent-browser get title # Get page title agent-browser get url # Get current URL agent-browser get count ".item" # Count matching elements agent-browser get box @e1 # Get bounding box agent-browser get styles @e1 # Get computed styles agent-browser get cdp-url # Get CDP WebSocket URL agent-browser is visible @e1 # Check visibility agent-browser is enabled @e1 # Check if enabled agent-browser is checked @e1 # Check if checked
Screenshots and PDF
agent-browser screenshot # Screenshot to temp dir agent-browser screenshot page.png # Save to file agent-browser screenshot @e1 # Screenshot specific element agent-browser screenshot --full # Full page agent-browser screenshot --annotate # Numbered labels on interactive elements agent-browser screenshot --annotate ./page.png # Save annotated screenshot agent-browser screenshot --screenshot-dir ./shots # Save to custom directory agent-browser screenshot --screenshot-format jpeg --screenshot-quality 80 agent-browser pdf output.pdf # Save as PDF
The --annotate flag overlays [N] labels matching refs (@eN). Also caches refs — interact without a separate snapshot:
agent-browser screenshot --annotate # [1] @e1 button "Submit" [2] @e2 link "Home" [3] @e3 textbox "Email" agent-browser click @e2 # Click element labeled [2]
Full page capture (screenshot + text + PDF):
agent-browser open "$URL" && agent-browser wait --load networkidle agent-browser screenshot --full ./page.png agent-browser get text body > ./page.txt agent-browser pdf ./page.pdf
Wait
agent-browser wait @e1 # Wait for element
agent-browser wait "#spinner" --state hidden # Wait for element to disappear
agent-browser wait 2000 # Wait milliseconds
agent-browser wait --text "Success" # Wait for text (substring match)
agent-browser wait --load networkidle # Wait for network idle
agent-browser wait --url "**/dashboard" # Wait for URL pattern
agent-browser wait --fn "window.ready" # Wait for JS condition
agent-browser wait --fn "!document.body.innerText.includes('Loading...')" # Wait for text to disappear
agent-browser wait --download [path] # Wait for download to complete
Sessions (parallel browsers)
agent-browser --session test1 open site-a.com agent-browser --session test2 open site-b.com agent-browser session # Show current session name agent-browser session list # List active sessions
Command chaining
Commands can be chained with &&. The browser persists via a background daemon:
agent-browser open example.com && agent-browser wait --load networkidle && agent-browser snapshot -i
Use && when you don't need intermediate output. Run commands separately when you need to parse output (e.g., snapshot to discover refs before interacting).
JSON output
agent-browser snapshot -i --json agent-browser get text @e1 --json
Tabs
agent-browser tab # List tabs agent-browser tab new [url] # New tab agent-browser tab 2 # Switch to tab 2 agent-browser tab close [n] # Close tab
Dialogs
agent-browser dialog accept [text] # Accept dialog (optional prompt text) agent-browser dialog dismiss # Dismiss dialog agent-browser dialog status # Check if a dialog is currently open
JavaScript
agent-browser eval 'document.title' # Evaluate expression agent-browser eval -b "<base64>" # Base64-encoded script agent-browser eval --stdin <<< 'return 1+1' # Script from stdin
State management
agent-browser state save auth.json # Save cookies, storage, auth state agent-browser state load auth.json # Restore saved state agent-browser state list # List saved state files agent-browser state show auth.json # Show state summary agent-browser state rename old new # Rename state file agent-browser state clear [name] # Clear states for session name agent-browser state clean --older-than 7 # Delete old states
Debugging
agent-browser open example.com --headed # Show browser window agent-browser --color-scheme dark open example.com # Dark mode agent-browser highlight @e1 # Highlight element visually agent-browser inspect # Open Chrome DevTools agent-browser console # View console messages agent-browser errors # View page errors agent-browser close # Close browser agent-browser close --all # Close all active sessions
Authentication
# Auth vault — credentials stored encrypted, login by name (recommended) echo "$PASSWORD" | agent-browser auth save myapp --url https://app.example.com/login --username user --password-stdin agent-browser auth login myapp # Login using saved credentials agent-browser auth list # List saved profiles agent-browser auth show myapp # Show profile metadata agent-browser auth delete myapp # Delete profile # Persistent profile — reuse login across restarts (cookies, IndexedDB, cache) agent-browser --profile ~/.myapp open https://app.example.com/login # ... login once, then all future runs are authenticated: agent-browser --profile ~/.myapp open https://app.example.com/dashboard # Session-name persistence — auto-save/restore cookies + localStorage agent-browser --session-name myapp open https://app.example.com/login # ... login, then close. State auto-saved to ~/.agent-browser/sessions/ agent-browser --session-name myapp open https://app.example.com/dashboard # Auto-restored # Connect to existing Chrome — reuse its logged-in state agent-browser --auto-connect state save ./auth.json agent-browser --state ./auth.json open https://app.example.com/dashboard # CDP — connect to a running browser agent-browser connect 9222 # Connect once, then run commands agent-browser --cdp 9222 snapshot # Or per-command
auth login waits for login form selectors to appear before filling — more reliable on SPA login pages. For OAuth, SSO, 2FA, and token refresh patterns, see complex-login-flows.md.
Session expiry: After loading state, verify you're still authenticated:
agent-browser --state ./auth.json open https://app.example.com/dashboard URL=$(agent-browser get url) # If redirected to login, session expired — re-authenticate
Iframes
Iframe content is automatically inlined in snapshots. Refs inside iframes carry frame context — interact directly without switching frames:
agent-browser snapshot -i # @e2 [Iframe] "payment-frame" # @e3 [input] "Card number" agent-browser fill @e3 "4111111111111111" # Works directly # Or scope to one iframe agent-browser frame @e2 # Switch to iframe agent-browser snapshot -i # Only iframe content agent-browser frame main # Return to main frame
Diff (verifying changes)
agent-browser diff snapshot # Compare current vs last snapshot agent-browser diff screenshot --baseline before.png # Visual pixel diff agent-browser diff url <url1> <url2> # Compare two pages
Batch execution
echo '[["open", "https://example.com"], ["snapshot", "-i"], ["screenshot", "result.png"]]' | agent-browser batch --json agent-browser batch --bail < commands.json # Stop on first error
Streaming
agent-browser stream enable # Start WebSocket streaming (auto port) agent-browser stream enable --port 9223 # Specific port agent-browser stream status # Show streaming state agent-browser stream disable # Stop streaming
Network inspection
agent-browser network requests # View tracked requests agent-browser network requests --type xhr,fetch # Filter by resource type agent-browser network requests --method POST # Filter by HTTP method agent-browser network requests --status 2xx # Filter by status agent-browser network request <requestId> # Full request/response detail agent-browser network har start # Start HAR recording agent-browser network har stop ./capture.har # Stop and save HAR
Clipboard
agent-browser clipboard read # Read text from clipboard agent-browser clipboard write "text" # Write text to clipboard agent-browser clipboard copy # Copy current selection agent-browser clipboard paste # Paste from clipboard
Security
agent-browser --content-boundaries open example.com # Wrap output in boundary markers (recommended for AI) agent-browser --allowed-domains "example.com,*.example.com" open example.com # Domain allowlist agent-browser --max-output 50000 snapshot # Prevent context flooding
References
- •complex-login-flows.md — OAuth, SSO, 2FA, token refresh, importing auth from Chrome
- •existing-browsers.md — CDP, auto-connect, Electron apps
- •remote-environments.md — iOS simulator, cloud providers, Lightpanda engine
- •network-control.md — Route/mock, proxy, HAR, domain allowlist, action policy
- •capture-and-profiling.md — Video recording, DevTools profiling, tracing
- •runtime-config.md — Config file, env vars, engine selection