AgentSkillsCN

computeruse

Gemini 计算机应用——结合 AI 视觉技术的浏览器自动化

SKILL.md
--- frontmatter
name: computeruse
version: 1.0.0
description: Gemini Computer Use - Browser automation with AI vision

Computer Use Skill (Gemini Browser Automation)

Enable Rhea to see and control a browser using Gemini 2.5 Computer Use model. The AI analyzes screenshots and generates mouse/keyboard actions.

Features

  • Visual Understanding: AI sees the screen via screenshots
  • Browser Control: Click, type, scroll, navigate
  • Multi-step Tasks: Chains actions to complete complex goals
  • Safety Checks: Built-in confirmation for risky actions

Capabilities

ActionDescription
run_taskExecute a multi-step browser task
click_atClick at coordinates
type_text_atType text at coordinates
navigateGo to a URL
scrollScroll the page
take_screenshotCapture current state

Requirements

bash
pip install google-genai playwright
playwright install chromium

Usage Examples

Run a Web Research Task

python
from rhea_noir.skills.computeruse.actions import skill as cu

result = cu.run_task(
    goal="Search for 'best AI frameworks 2026' on Google and list the top 3 results",
    start_url="https://www.google.com",
    max_steps=10
)
print(result["final_answer"])

Automate Form Filling

python
result = cu.run_task(
    goal="Fill out the contact form with name 'Dave', email 'dave@example.com', and submit",
    start_url="https://example.com/contact"
)

Web Scraping with Context

python
result = cu.run_task(
    goal="Go to Amazon and find the price of Sony WH-1000XM5 headphones",
    start_url="https://www.amazon.com"
)

Safety Features

The model includes built-in safety checks:

  • require_confirmation: User must approve risky actions
  • Excluded actions: Block specific UI actions if needed
python
result = cu.run_task(
    goal="...",
    excluded_actions=["drag_and_drop"],
    require_human_confirmation=True
)

Supported UI Actions

ActionDescription
open_web_browserOpen browser
navigateGo to URL
click_atClick at (x, y)
type_text_atType text at (x, y)
scroll_documentScroll up/down/left/right
scroll_atScroll at specific location
hover_atHover mouse at (x, y)
key_combinationPress keys (e.g., "Control+C")
go_backBrowser back
go_forwardBrowser forward
wait_5_secondsWait for page load
drag_and_dropDrag element

Model

Uses gemini-2.5-computer-use-preview-10-2025 - specialized for browser control.

[!CAUTION] Computer Use is a Preview feature. Supervise closely for important tasks.