AgentSkillsCN

osaurus-macos-use

通过辅助功能 API 控制 macOS。适用于用户希望与原生 Mac 应用交互、自动化 UI 任务、在 Safari 中浏览网页、填写表单、导航菜单,或执行任何屏幕上的操作时使用。

SKILL.md
--- frontmatter
name: osaurus-macos-use
description: Control macOS via accessibility APIs. Use when the user asks to interact with native Mac apps, automate UI tasks, browse the web in Safari, fill forms, navigate menus, or perform any on-screen action.
metadata:
  author: osaurus
  version: "0.2.0"

Osaurus macOS Use

Automate macOS through accessibility APIs. This plugin gives you direct control over any application's UI — click buttons, type text, fill forms, navigate menus, browse the web in Safari, and more.

Core Workflow

Every interaction follows the Open-Observe-Act pattern:

  1. Open the appopen_application to launch/activate, returns a pid.
  2. Observe the UI — ALWAYS call get_ui_elements with the pid before doing anything else. This confirms the app is ready and gives you element IDs. Never skip this step. Never send keyboard or mouse actions before observing.
  3. Act — use element IDs to click_element, type_text, set_value, press_key, etc.
  4. Re-observe when the UI changes — after navigation, dialogs, tab switches, or form submissions. Do NOT re-observe after actions that don't change the visible UI (typing, toggling, pressing shortcuts).

This decoupled approach keeps token usage low. A typical 5-step interaction costs ~3K tokens vs ~150K if you re-observed after every action.

When to Re-Observe

Call get_ui_elements again when:

  • You clicked a button/link that opens a new view, dialog, page, or menu
  • You switched tabs or windows
  • You submitted a form and the UI refreshed
  • An element ID returns "Element not found"

Do NOT re-observe when:

  • You just typed text into a field (the field still has focus)
  • You just pressed a keyboard shortcut (e.g., Cmd+S to save)
  • You clicked a toggle, checkbox, or other control that doesn't change the surrounding UI

Tool Selection Guide

Clicking

  • click_element — Default choice. Uses accessibility actions (most reliable). Supports button: "right" for context menus and doubleClick: true for double-clicks.
  • click — Fallback for coordinate-based clicks. Only use when elements aren't accessible (canvas apps, image regions, screenshot-guided interaction).

Entering Text

  • set_value — Best for form fields. Directly sets the element's value. Instant, reliable, replaces existing content.
  • type_text — Simulates keystroke-by-keystroke typing. Use when set_value doesn't work (e.g., search fields that need live filtering, password fields, or fields that trigger on-type events). Pass id to auto-focus the element first.

Prefer set_value over type_text when filling forms. Fall back to type_text if set_value returns an error.

Screenshots

  • take_screenshot — Use when the accessibility tree is insufficient to understand the visual layout (e.g., verifying styling, reading images, canvas apps, or when elements don't have labels).
  • get_ui_elements — Preferred for most interactions. Lighter, faster, and returns structured data.

Use take_screenshot with pid to capture a specific app window. Default settings (JPEG, 0.7 quality, 0.5 scale) are optimized for token efficiency.

Keyboard

  • press_key — For keyboard shortcuts, navigation keys, and special keys. Always prefer keyboard shortcuts over UI clicking when available (faster, more reliable).

Scrolling

  • scroll — Pass x/y to scroll a specific area. Without coordinates, scrolls at the current mouse position. Use amount to control scroll distance (default: 3 pixels).

Token Efficiency Tips

  1. Use interactiveOnly: true (default) when calling get_ui_elements. Only set to false when you need to read static text labels.
  2. Keep maxElements low. Default is 100. For simple UIs (dialogs, settings panes), use 30-50. For complex UIs (web pages), use 100-150.
  3. Use roles filter to narrow results. For example, roles: ["button"] when looking for a specific button, or roles: ["textField", "textArea"] when looking for input fields.
  4. Avoid unnecessary screenshots. Screenshots consume vision tokens. Use get_ui_elements first — only screenshot if you need visual context.
  5. Batch actions between observations. After the initial observe, perform multiple actions (click, type, press key) before re-observing — but always do that initial observe after open_application.
  6. Use keyboard shortcuts instead of navigating menus. press_key("s", modifiers: ["command"]) is cheaper than finding and clicking File > Save.

Common Recipes

Open an App and Inspect It

Always observe after opening — never skip this step:

code
1. open_application(identifier: "Notes")
   → { pid: 1234, name: "Notes" }

2. get_ui_elements(pid: 1234)
   → Returns elements with IDs — app is confirmed ready

Click a Button

After observing (step 2 above), find the element and click it:

code
click_element(id: 5)
→ { success: true }

Fill a Text Field

Use roles to filter for input fields, then set the value:

code
1. get_ui_elements(pid: 1234, roles: ["textField", "searchField"])
   → Find text field with ID = 8

2. set_value(id: 8, value: "Hello, world!")
   → { success: true }

If set_value fails, fall back to type_text:

code
type_text(text: "Hello, world!", id: 8)
→ { success: true }

Navigate a Menu

Use keyboard shortcuts when possible. Otherwise:

code
1. click_element(id: <menu_bar_item_id>)
   → Opens menu

2. get_ui_elements(pid: 1234, roles: ["menuItem"])
   → Find the menu item

3. click_element(id: <menu_item_id>)

Right-Click for Context Menu

code
1. click_element(id: 5, button: "right")
   → Opens context menu

2. get_ui_elements(pid: 1234, roles: ["menuItem"])
   → Find context menu items

3. click_element(id: <menu_item_id>)

Handle a Dialog

After an action triggers a dialog:

code
1. get_ui_elements(pid: 1234)
   → Dialog elements appear (buttons like "OK", "Cancel", "Save")

2. click_element(id: <ok_button_id>)

Switch Between Apps

code
1. open_application(identifier: "Safari")
   → { pid: 5678 }

2. get_ui_elements(pid: 5678)
   → Safari's UI elements — now safe to interact

You can also use press_key("tab", modifiers: ["command"]) to switch, but always follow up with get_ui_elements before sending any input to the newly focused app.

Safari Web Browsing

Safari's web content is fully accessible through the accessibility tree. Links, buttons, headings, text fields, and other interactive elements all appear in get_ui_elements.

Navigate to a URL

code
1. open_application(identifier: "Safari")
   → { pid: 5678 }

2. get_ui_elements(pid: 5678)
   → Confirm Safari is loaded and ready

3. press_key("l", modifiers: ["command"])
   → Focuses the address bar

4. type_text(text: "https://example.com")

5. press_key("return")
   → Page loads

6. get_ui_elements(pid: 5678)
   → Web page elements (links, buttons, inputs)

Click a Link on a Web Page

Once Safari is open and observed:

code
1. click_element(id: 12)
   → Navigates to sign-in page (ID 12 was "Sign In" link from observation)

2. get_ui_elements(pid: 5678)
   → New page elements (re-observe because the page changed)

Fill a Web Form

After navigating to a page with a form:

code
1. get_ui_elements(pid: 5678, roles: ["textField"])
   → Find email field ID = 15, password field ID = 16

2. set_value(id: 15, value: "user@example.com")
3. set_value(id: 16, value: "password123")
4. click_element(id: <submit_button_id>)

Search the Web

Assumes Safari is already open and observed (you have the pid):

code
1. press_key("l", modifiers: ["command"])
2. type_text(text: "weather in San Francisco")
3. press_key("return")
4. get_ui_elements(pid: 5678)
   → Search results page elements

Tab Management

  • New tab: press_key("t", modifiers: ["command"])
  • Close tab: press_key("w", modifiers: ["command"])
  • Next tab: press_key("}", modifiers: ["command", "shift"])
  • Previous tab: press_key("{", modifiers: ["command", "shift"])
  • Reopen closed tab: press_key("z", modifiers: ["command", "shift"])

Reading Page Content

Use get_ui_elements with interactiveOnly: false to read static text on a page. If the page layout matters, use take_screenshot to visually inspect it.

Scrolling a Web Page

code
scroll(direction: "down", amount: 5, x: 700, y: 400)

Pass the center of the Safari content area as x/y to ensure scrolling happens in the right place.

macOS Keyboard Shortcuts

Use these with press_key to avoid navigating menus:

System

ActionKeyModifiers
Switch apptab["command"]
Spotlight searchspace["command"]
Force quitescape["command", "option"]
Lock screenq["command", "control"]
Screenshot (clipboard)3["command", "shift"]
Screenshot (selection)4["command", "shift"]

File Operations

ActionKeyModifiers
Saves["command"]
Save Ass["command", "shift"]
Openo["command"]
Newn["command"]
Close windoww["command"]
Quit appq["command"]
Printp["command"]

Editing

ActionKeyModifiers
Copyc["command"]
Cutx["command"]
Pastev["command"]
Undoz["command"]
Redoz["command", "shift"]
Select alla["command"]
Findf["command"]
Find nextg["command"]

Safari

ActionKeyModifiers
Focus address barl["command"]
New tabt["command"]
Close tabw["command"]
Reloadr["command"]
Back[["command"]
Forward]["command"]
Downloadsl["command", "option"]
Bookmarksb["command", "option"]
Reader moder["command", "shift"]

Navigation

ActionKeyModifiers
Next fieldtab
Previous fieldtab["shift"]
Confirm/submitreturn
Cancel/dismissescape
Page uppageup
Page downpagedown
Top of pagehome
Bottom of pageend

Tool Reference

open_application

  • Accepts app name ("Safari"), bundle ID ("com.apple.Safari"), or file path.
  • If already running, activates the app. Otherwise launches it.
  • Returns pid, bundleId, and name.

get_ui_elements

  • Returns interactive elements with assigned IDs. Each element has: id, role, label, value, x, y, w, h, actions.
  • IDs are valid until the next get_ui_elements call (which resets the cache).
  • Use roles filter for targeted queries: ["button"], ["textField", "textArea"], ["link"], ["menuItem"], etc.
  • Common roles: button, link, textField, textArea, checkBox, radioButton, popUpButton, comboBox, slider, menuItem, tab, searchField.

click_element

  • Left-click by default. Pass button: "right" for right-click. Pass doubleClick: true for double-click.
  • Uses AXPress action first (most reliable), falls back to coordinate click.
  • Returns { success: true } or { success: false, error: "..." }.

click

  • Clicks at raw screen coordinates. Only use when elements aren't accessible.
  • Supports button (left/right/center) and doubleClick.

type_text

  • Types keystroke-by-keystroke into the focused element.
  • Pass id to auto-focus an element before typing.
  • Use for search fields, password fields, or fields that need on-type events.

set_value

  • Directly sets an element's value via accessibility API.
  • Preferred over type_text for form fields — instant and replaces existing content.
  • Returns error if the element isn't editable.

press_key

  • Key names: return, escape, tab, delete, space, up, down, left, right, f1-f12, home, end, pageup, pagedown, or single characters (a, 1, ,, etc.).
  • Modifier names: command, shift, option, control.

scroll

  • Directions: up, down, left, right.
  • amount controls scroll distance in pixels (default: 3). Use higher values (5-10) for faster scrolling.
  • Pass x/y to position the mouse before scrolling (important for scrolling specific areas).

drag

  • Drags from (startX, startY) to (endX, endY).
  • Useful for sliders, window resizing, drag-and-drop, and drawing.

take_screenshot

  • Defaults: JPEG format, 0.7 quality, 0.5 scale.
  • Pass pid to capture a specific app's window.
  • Pass savePath to save to disk (avoids base64 token costs).
  • Returns MCP ImageContent format for vision model consumption.

get_active_window

  • Returns: pid, app name, title, x, y, w, h.
  • Useful when you don't know which app is in front.

list_displays

  • Returns all connected displays with index, position, and dimensions.
  • Only needed for multi-monitor setups.

Troubleshooting

"Element not found"

The element cache was reset or the element is no longer on screen. Call get_ui_elements again to refresh.

"Failed to set element value"

The element may not be editable via accessibility. Fall back to type_text with the element id.

No elements returned

  • Verify the pid is correct (use get_active_window to check).
  • Some apps have poor accessibility support. Try take_screenshot and use coordinate-based click instead.
  • For web content in Safari, ensure the page has fully loaded before querying elements.

Stale element positions

Elements may move after window resize or scroll. Call get_ui_elements again if coordinate-based fallback clicks miss.

Accessibility permission denied

The host application needs Accessibility permission in System Settings > Privacy & Security > Accessibility.

Limitations

  • Canvas-based apps (Figma, games) — No element tree. Use take_screenshot + click with coordinates.
  • Poorly accessible apps — Some apps don't expose their UI through accessibility APIs. Use screenshot-guided coordinate clicks as fallback.
  • Complex web apps — Very dynamic SPAs may have elements that appear/disappear rapidly. Re-observe frequently and use shorter maxElements.
  • Element modification — Cannot reorder, resize, or restyle UI elements. This plugin observes and interacts with the existing UI.