AgentSkillsCN

cua-cloud

全面指南:基于CUA框架构建计算机使用智能体。当需要自动化桌面应用、打造基于视觉的智能体、操控虚拟机(Linux/Windows/macOS),或集成来自Anthropic、OpenAI等厂商的计算机使用模型时,应优先选用此技能。内容涵盖计算机SDK(点击、输入、滚动、截屏)、智能体SDK(模型配置、组合架构)、支持的模型类型、提供商配置,以及MCP集成方案。

SKILL.md
--- frontmatter
name: cua-cloud
description: Comprehensive guide for building Computer Use Agents with the CUA framework. This skill should be used when automating desktop applications, building vision-based agents, controlling virtual machines (Linux/Windows/macOS), or integrating computer-use models from Anthropic, OpenAI, or other providers. Covers Computer SDK (click, type, scroll, screenshot), Agent SDK (model configuration, composition), supported models, provider setup, and MCP integration.

CUA Framework

Overview

CUA ("koo-ah") is an open-source framework for building Computer Use Agents—AI systems that see, understand, and interact with desktop applications through vision and action. It supports Windows, Linux, and macOS automation.

Key capabilities:

  • Vision-based UI automation via screenshot analysis
  • Multi-platform desktop control (click, type, scroll, drag)
  • 100+ LLM providers via LiteLLM integration
  • Composed agents (grounding + planning models)
  • Local and cloud execution options

Installation

bash
# Computer SDK - desktop control
pip install cua-computer

# Agent SDK - autonomous agents
pip install cua-agent[all]

# MCP Server (optional)
pip install cua-mcp-server

CLI Installation:

bash
# macOS/Linux
curl -LsSf https://cua.ai/cli/install.sh | sh

# Windows
powershell -ExecutionPolicy ByPass -c "irm https://cua.ai/cli/install.ps1 | iex"

Computer SDK

Computer Class

python
from computer import Computer
import os

os.environ["CUA_API_KEY"] = "sk_cua-api01_..."

computer = Computer(
    os_type="linux",      # "linux" | "macos" | "windows"
    provider_type="cloud", # "cloud" | "docker" | "lume" | "windows_sandbox"
    name="sandbox-name"
)

try:
    await computer.run()
    # Use computer.interface methods here
finally:
    await computer.close()

Interface Methods

Screenshot:

python
screenshot = await computer.interface.screenshot()

Mouse Actions:

python
await computer.interface.left_click(x, y)      # Left click at coordinates
await computer.interface.right_click(x, y)     # Right click
await computer.interface.double_click(x, y)    # Double click
await computer.interface.move_cursor(x, y)     # Move cursor without clicking
await computer.interface.drag(x1, y1, x2, y2)  # Click and drag

Keyboard Actions:

python
await computer.interface.type_text("Hello!")   # Type text
await computer.interface.key_press("enter")    # Press single key
await computer.interface.hotkey("ctrl", "c")   # Key combination

Scrolling:

python
await computer.interface.scroll(direction, amount)  # Scroll up/down/left/right

File Operations:

python
content = await computer.interface.read_file("/path/to/file")
await computer.interface.write_file("/path/to/file", "content")

Clipboard:

python
text = await computer.interface.get_clipboard()
await computer.interface.set_clipboard("text to copy")

Supported Actions (Message Format)

OpenAI-style:

  • ClickAction - button: left/right/wheel/back/forward, x, y coordinates
  • DoubleClickAction - same parameters as click
  • DragAction - start and end coordinates
  • KeyPressAction - key name
  • MoveAction - x, y coordinates
  • ScreenshotAction - no parameters
  • ScrollAction - direction and amount
  • TypeAction - text string
  • WaitAction - duration

Anthropic-style:

  • LeftMouseDownAction - x, y coordinates
  • LeftMouseUpAction - x, y coordinates

Agent SDK

ComputerAgent Class

python
from agent import ComputerAgent

agent = ComputerAgent(
    model="anthropic/claude-sonnet-4-5-20250929",
    tools=[computer],
    max_trajectory_budget=5.0  # Cost limit in USD
)

messages = [{"role": "user", "content": "Open Firefox and go to google.com"}]

async for result in agent.run(messages):
    for item in result["output"]:
        if item["type"] == "message":
            print(item["content"][0]["text"])

Response Structure

python
{
    "output": [AgentMessage, ...],  # List of messages
    "usage": {
        "prompt_tokens": int,
        "completion_tokens": int,
        "total_tokens": int,
        "response_cost": float
    }
}

Message Types:

  • UserMessage - Input from user/system
  • AssistantMessage - Text output from agent
  • ReasoningMessage - Agent thinking/summary
  • ComputerCallMessage - Intent to perform action
  • ComputerCallOutputMessage - Screenshot result
  • FunctionCallMessage - Python tool invocation
  • FunctionCallOutputMessage - Function result

Supported Models

CUA VLM Router (Recommended)

python
model="cua/anthropic/claude-sonnet-4.5"  # Recommended
model="cua/anthropic/claude-haiku-4.5"   # Faster, cheaper

Single API key, cost tracking, managed infrastructure.

Anthropic (BYOK)

python
os.environ["ANTHROPIC_API_KEY"] = "sk-ant-..."

model="anthropic/claude-sonnet-4-5-20250929"
model="anthropic/claude-haiku-4-5-20251001"
model="anthropic/claude-opus-4-20250514"
model="anthropic/claude-3-7-sonnet-20250219"

OpenAI (BYOK)

python
os.environ["OPENAI_API_KEY"] = "sk-..."

model="openai/computer-use-preview"

Google Gemini

python
model="gemini-2.5-computer-use-preview-10-2025"

Local Models

python
model="huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B"
model="ollama_chat/0000/ui-tars-1.5-7b"

Composed Agents

Combine grounding models with planning models:

python
model="huggingface-local/GTA1-7B+openai/gpt-4o"
model="moondream3+openai/gpt-4o"
model="omniparser+anthropic/claude-sonnet-4-5-20250929"
model="omniparser+ollama_chat/mistral-small3.2"

Grounding Models: UI-TARS, GTA, Holo, Moondream, OmniParser, OpenCUA

Human-in-the-Loop

python
model="human/human"  # Pause for user approval

Provider Types

Cloud (Recommended)

python
computer = Computer(
    os_type="linux",  # linux, windows, macos
    provider_type="cloud",
    name="sandbox-name",
    api_key="sk_cua-api01_..."
)

Get API key from cloud.trycua.com.

Docker (Local)

python
computer = Computer(
    os_type="linux",
    provider_type="docker"
)

Images: trycua/cua-xfce:latest, trycua/cua-ubuntu:latest

Lume (macOS Local)

python
computer = Computer(
    os_type="linux",
    provider_type="lume"
)

Requires Lume CLI installation.

Windows Sandbox

python
computer = Computer(
    os_type="windows",
    provider_type="windows_sandbox"
)

Requires pywinsandbox and Windows Sandbox feature enabled.

MCP Integration

This project uses the CUA MCP Server for Claude Code integration:

json
{
  "mcpServers": {
    "cua": {
      "type": "http",
      "url": "https://cua-mcp-server.vercel.app/mcp"
    }
  }
}

MCP Tools Available

Sandbox Management:

  • mcp__cua__list_sandboxes - List all sandboxes
  • mcp__cua__create_sandbox - Create VM (os, size, region)
  • mcp__cua__start/stop/restart/delete_sandbox

Task Execution:

  • mcp__cua__run_task - Autonomous task execution
  • mcp__cua__describe_screen - Vision analysis without action
  • mcp__cua__get_task_history - Retrieve task results

Best Practices

Task Design

python
# Good - specific and sequential
"Open Chrome, navigate to github.com, click the Sign In button"

# Avoid - vague
"Log into GitHub"

Error Recovery

python
async for result in agent.run(messages):
    if result.get("error"):
        # Take screenshot to understand state
        screenshot = await computer.interface.screenshot()
        # Retry with more specific instructions

Resource Management

python
try:
    await computer.run()
    # ... perform tasks
finally:
    await computer.close()  # Always cleanup

Cost Control

python
agent = ComputerAgent(
    model="cua/anthropic/claude-sonnet-4.5",
    max_trajectory_budget=5.0  # Stop at $5 spent
)

Resources