OpenBowie — Claude Code Skill
OpenBowie is an autonomous agent that runs tasks inside Docker containers. You can spin up agents, send them messages, and read their output programmatically.
Prerequisites
- •
openbowiebinary must be built (make openbowiein the openbowie repo) - •Docker must be running
- •Agent image must be built (
make agent-image) - •At least one LLM config must exist in
~/.openbowie/configs/(runopenbowie onboardto set up)
Commands
Create a new task (headless)
openbowie new --headless --config <config_name> [--mcp <mcp_name>] [--soul <soul_name>] "task description"
Starts the agent in the background and prints the task ID to stdout. The agent immediately begins working on the task autonomously.
- •
--config: Name of the LLM config (e.g.,anthropic,openai) - •
--mcp: Optional MCP server config (e.g.,duckduckgo,factor-mcp) - •
--soul: Optional soul/persona (default:default). Souls live in~/.openbowie/souls/<name>.md - •
--headless: Required for programmatic use — skips the TUI
Example:
TASK_ID=$(openbowie new --headless --config anthropic --mcp duckduckgo "research the latest AI agent frameworks and summarize findings")
Send a message to a running agent
openbowie send <task_id> "your message here"
Sends a message to a running agent, waits for the response, and prints it to stdout. Tool calls are printed to stderr. Returns when the agent goes idle.
Example:
openbowie send "$TASK_ID" "now compare the top 3 frameworks by features"
Read task files
openbowie read <task_id> [status|roadmap|memory|logs]
Reads a specific task file. If no file is specified, prints all files.
- •
status: Current task status (short summary) - •
roadmap: Task plan with steps (checked off as completed) - •
memory: Conversation transcript and agent notes - •
logs: Activity log with timestamps and raw tool results
Examples:
# Read everything openbowie read "$TASK_ID" # Check current status openbowie read "$TASK_ID" status # See the roadmap openbowie read "$TASK_ID" roadmap # Read conversation transcript openbowie read "$TASK_ID" memory
Other useful commands
# List all tasks openbowie list # Stop a running task openbowie stop <task_id> # Remove a task and its data openbowie rm <task_id> # Remove all containers and agent image openbowie clean
Typical workflow
- •Create a task with
--headless— capture the task ID - •Wait a few seconds for the agent to start and begin working
- •Poll
openbowie read <id> statusto check progress - •Send follow-up messages with
openbowie send <id> "message"if needed - •Read final results with
openbowie read <id> roadmaporopenbowie read <id> memory - •Stop and clean up with
openbowie stop <id>andopenbowie rm <id>
Notes
- •The agent starts working immediately after creation — no need to send a first message
- •The agent is autonomous: it creates a roadmap, executes steps, and recovers from errors on its own
- •
openbowie sendblocks until the agent finishes responding (up to 5 min timeout) - •Task files are persisted on disk at
~/.openbowie/tasks/task_<id>/ - •If the agent's container stops,
openbowie sendwill return an error — useopenbowie restart <id>to restart it
MCP Server Testing with OpenBowie
Use OpenBowie as a test harness to verify that different LLM models can correctly use MCP server tools. This is essential for validating tool descriptions, parameter schemas, and multi-step workflows across models.
Why Test MCP Servers with Multiple Models?
- •Different models interpret tool descriptions differently
- •Parameter naming and schema clarity affects model success rates
- •Multi-step workflows expose chaining issues
- •Testing reveals where tool descriptions need improvement
Test Structure
Create a test directory inside your MCP server project:
<your-mcp>/tests/openbowie/
├── checklist.md # Master test list with pass/fail criteria
├── prompts/ # Test case prompt files (one per test)
│ ├── 01-basic-query.txt
│ ├── 02-parameter-test.txt
│ ├── 03-multi-step.txt
│ └── ...
└── results/
└── <model-name>/ # One folder per model tested
├── 01-basic-query/
│ ├── memory.txt # Agent's reasoning and output
│ ├── logs.txt # Full tool call logs
│ └── verdict.md # PASS/PARTIAL/FAIL + notes
└── summary.md # Aggregated results
Writing Test Prompts
Prompts should be natural language — like a real human would type. Don't tell the model which tools to call or what parameters to use. The test is whether the model can figure that out from tool descriptions.
Principles:
- •Write like a human — no tool names, no parameter hints
- •Start simple — "What's the config?" before "Set up full lending from scratch"
- •Include context when needed — vault addresses, chain names, token names
- •Gradually increase complexity — single intent → multi-step workflows
Example prompt files:
What chain am I on right now? Is there a wallet configured?
Deploy a new vault on Arbitrum that uses USDC. Call it "My Vault" with symbol "MV".
I have a vault at 0x1234...abcd. Set it up for Aave lending with USDC and then supply some.
Running Tests Interactively
Launch tests from the MCP server directory:
# Launch a single test TASK_ID=$(openbowie new --headless --config <model> --mcp <your-mcp> "$(cat tests/openbowie/prompts/01-basic-query.txt)") # Check status openbowie list # Read agent output when done openbowie read $TASK_ID memory # Read full tool call logs openbowie logs $TASK_ID # Save results mkdir -p tests/openbowie/results/<model>/01-basic-query openbowie read $TASK_ID memory > tests/openbowie/results/<model>/01-basic-query/memory.txt openbowie logs $TASK_ID > tests/openbowie/results/<model>/01-basic-query/logs.txt # Clean up openbowie rm $TASK_ID
Running Tests in Parallel
T1=$(openbowie new --headless --config minimax --mcp <your-mcp> "$(cat tests/openbowie/prompts/01-basic-query.txt)") T2=$(openbowie new --headless --config minimax --mcp <your-mcp> "$(cat tests/openbowie/prompts/02-parameter-test.txt)") T3=$(openbowie new --headless --config minimax --mcp <your-mcp> "$(cat tests/openbowie/prompts/03-multi-step.txt)") # Monitor all openbowie list
Pass/Fail Criteria
- •PASS: Correct tool(s) called with correct parameters, valid response, structured output matches expected values
- •PARTIAL: Right tools but wrong parameters, or incomplete/garbled results
- •FAIL: Wrong tools, hallucinated results, crashed, or timed out
Iterating on Failures
When a model fails a test:
- •Read
logs.txtto see exactly what tool calls were made - •Determine root cause: model issue, tool description issue, or prompt issue
- •If tool descriptions need improvement — update your MCP server code, rebuild, and re-test
- •If the prompt was confusing — rewrite it to be more explicit
- •Re-run only the failing test
Comparing Models
Run the same test suite against multiple LLM configs:
openbowie new --headless --config minimax --mcp <your-mcp> "$(cat prompt.txt)" openbowie new --headless --config openai --mcp <your-mcp> "$(cat prompt.txt)" openbowie new --headless --config anthropic --mcp <your-mcp> "$(cat prompt.txt)"
Build a comparison matrix in summary.md:
| Test | MiniMax | GPT-4o | Claude | Qwen |
|---|---|---|---|---|
| 01 | PASS | PASS | PASS | PASS |
| 02 | PASS | PARTIAL | PASS | FAIL |
Test Categories to Cover
- •Single-tool read-only — Can the model call one tool correctly?
- •Single-tool with parameters — Can it pass required/optional params?
- •Multi-step sequential — Can it chain 3-4 tools in order?
- •Cross-context — Can it use output from tool A as input to tool B?
- •Write operations — Can it call mutation tools with correct params? (may revert — that's fine, we're testing the call format)
- •Error handling — Does it handle tool errors gracefully?