Agentic Testing
Verify code changes by observing the running app. Build, start via named pipe, interact via stdin JSON, capture screenshots, read logs.
When to Use
- •After implementing any UI, protocol, or behavior change
- •For routine UI and behavior work, this is the default smoke test
- •When Oracle's autonomous verification says "Run the agentic-testing skill"
- •Before marking a task as complete
- •Especially after changes to: prompts, views, keyboard handlers, ACP chat, actions dialog
Safety Rules (MANDATORY)
- •NEVER delete files, directories, or data
- •NEVER modify databases, user data, or production state
- •NEVER run destructive commands (rm -rf, DROP, git push --force, git reset --hard)
- •NEVER send requests to external services, APIs, or webhooks
- •NEVER modify files outside the project directory
- •NEVER commit, push, or modify git history
- •ALL verification is read-only: build, launch, screenshot, grep, read logs
- •Temp pipes/logs may live under
/tmpvia the session wrapper. RuntimecaptureWindowscreenshots must go in project.test-screenshots//test-screenshots/or~/.scriptkit/screenshots - •The app runs locally only — never connect to production
- •Every verification run MUST stop every Script Kit process/session it started before reporting results
Surface Identity Rules (MANDATORY)
- •Always verify the real user-facing surface through its real runtime entry path first.
- •For Script Kit UI, prefer stdin JSON commands, built-in routing, and real app windows over ad hoc component harnesses.
- •Never treat an isolated GPUI entity, temporary debug window, story, off-screen render, or synthetic wrapper as proof of a real product surface unless the user explicitly asks for component-level verification.
- •Before trusting a screenshot, confirm the captured surface matches the intended product surface:
- •same entry path
- •same window/shell
- •same wrapper/root chrome
- •same footer, sizing, and layout structure
- •If the screenshot does not clearly match the real surface, stop and re-route verification to the real surface instead of iterating on the fake one.
- •For ACP specifically,
AcpChatViewin isolation is not sufficient proof. Default to the real ACP entry path (triggerBuiltin tab-ai, detached chat window routing, or another production runtime path) before using any synthetic ACP harness.
The Pattern
Every verification follows the same core loop:
1. Build
cargo build 2>&1 | tail -5
Must complete with Finished. If it fails, fix the build error first.
2. Start a Session (Preferred)
# Start or resume a named session — works from any shell # session.sh waits for the APP_READY log marker instead of sleeping SESSION_JSON="$(bash scripts/agentic/session.sh start default 2>/dev/null)" APP_PID="$(printf '%s' "$SESSION_JSON" | jq -r '.pid')" PIPE="$(printf '%s' "$SESSION_JSON" | jq -r '.pipe')" LOG="$(printf '%s' "$SESSION_JSON" | jq -r '.log')" READY="$(printf '%s' "$SESSION_JSON" | jq -r '.ready // false')" READY_WAIT_MS="$(printf '%s' "$SESSION_JSON" | jq -r '.readyWaitMs // 0')" # Fallback only if readiness marker was not observed. if [ "$READY" != "true" ]; then sleep 0.5 fi
The session wrapper manages the named pipe, forwarder process, and PID tracking.
Sessions are reusable across shells — no exec 3> / fd 3 trick required.
session.sh start means the app is stdin-ready, not necessarily capture-ready.
Before screenshot proof, send {"type":"show"} and allow a short settle.
Session commands:
bash scripts/agentic/session.sh start [NAME] # Create or resume (default: "default") bash scripts/agentic/session.sh send NAME CMD # Send JSON command bash scripts/agentic/session.sh status [NAME] # Check session state (JSON) bash scripts/agentic/session.sh stop [NAME] # Stop and clean up bun scripts/agentic/session-state.ts --session NAME # Detailed state report bun scripts/agentic/session-state.ts --list # List all sessions
All commands emit stable JSON envelopes on stdout (schemaVersion, status, payload).
Diagnostics go to stderr.
start is idempotent — re-running it resumes an existing healthy session.
Alternative (legacy, single-shell only):
PIPE=$(mktemp -u) mkfifo "$PIPE" export SCRIPT_KIT_AI_LOG=1 ./target/debug/script-kit-gpui < "$PIPE" > /tmp/sk-test.log 2>&1 & APP_PID=$! exec 3>"$PIPE" sleep 3
3. Show the Window
# Session-based (any shell)
bash scripts/agentic/session.sh send default '{"type":"show"}'
sleep 1.5
The app starts hidden. Always send show first.
4. Interact
Send commands via the session. Common commands:
S="bash scripts/agentic/session.sh send default"
# Set filter text
$S '{"type":"setFilter","text":"search term"}'
# Discover visible elements (returns semantic IDs)
bash scripts/agentic/session.sh rpc default '{"type":"getElements","requestId":"e1"}' --expect elementsResult
# Select element by semantic ID (from getElements response)
bash scripts/agentic/session.sh rpc default '{"type":"batch","requestId":"b1","commands":[{"type":"selectBySemanticId","semanticId":"choice:0:apple","submit":true}]}' --expect batchResult
# Trigger a built-in view
$S '{"type":"triggerBuiltin","name":"clipboard"}'
$S '{"type":"triggerBuiltin","name":"tab-ai"}'
$S '{"type":"triggerBuiltin","name":"emoji"}'
$S '{"type":"triggerBuiltin","name":"apps"}'
$S '{"type":"triggerBuiltin","name":"file-search"}'
# Simulate keys (dispatches to current view)
$S '{"type":"simulateKey","key":"enter","modifiers":[]}'
$S '{"type":"simulateKey","key":"escape","modifiers":[]}'
$S '{"type":"simulateKey","key":"k","modifiers":["cmd"]}'
$S '{"type":"simulateKey","key":"w","modifiers":["cmd"]}'
# Type individual characters (for views with text input)
$S '{"type":"simulateKey","key":"h","modifiers":[]}'
# Query ACP state (returns input, cursor, picker, accepted item, thread status)
bash scripts/agentic/session.sh rpc default '{"type":"getAcpState","requestId":"acp1"}' --expect acpStateResult
5. Capture Screenshots
mkdir -p .test-screenshots
bash scripts/agentic/session.sh send default '{"type":"captureWindow","title":"","path":"'"$(pwd)"'/.test-screenshots/step-01.png"}'
sleep 1
- •
titleis substring match.""matches any window. - •For embedded ACP in the main Script Kit window, use
title: ""or the resolver-drivenverify-shot.ts/window.tsflow. Do not assume the title containsACP Chat. - •Path must be absolute — use
$(pwd)/prefix. - •Runtime
captureWindowdoes not allow arbitrary/tmp/*.pngoutput paths. - •Always
sleep 1after capture for file write. - •The screenshot must come from the real runtime surface you are verifying, not a synthetic component window.
- •Read the PNG to visually verify. Never assume correctness without checking.
6. Read Logs
grep -i "keyword" /tmp/sk-test.log | head -20
Log format: TIMESTAMP|LEVEL|CATEGORY|cid=CORRELATION_ID message
7. Cleanup
# Session-based (preferred) bash scripts/agentic/session.sh stop default # Verify the session is actually gone before reporting success bash scripts/agentic/session.sh status default # Legacy fd 3 cleanup (single-shell only) # exec 3>&- # rm -f "$PIPE" # kill $APP_PID 2>/dev/null || true # wait $APP_PID 2>/dev/null || true
Cleanup is mandatory, even after failures or interrupted runs.
- •Do not report PASS or FAIL until the session you started has been stopped.
- •If you launched Script Kit via
session.sh, runsession.sh stop NAMEand verify the session is no longer alive. - •If you launched Script Kit directly, kill that specific PID and
waitfor it. - •Do not leave orphan
script-kit-gpuiprocesses behind from agentic testing.
8. Report
- •PASS: build succeeded + expected screenshots match + expected log output + cleanup confirmed
- •FAIL: describe what went wrong with evidence (screenshot, log line), then still clean up the launched process/session
Timing Guidelines
| Action | Wait strategy |
|---|---|
| App startup | session.sh start readiness wait; fallback 0.5s only if ready=false |
show window | 0.3s macOS focus-settling delay |
setFilter | 1s sleep or waitFor stateMatch |
triggerBuiltin (opens new view) | waitFor appropriate condition |
simulateKey (view transition) | 1.5s sleep |
simulateKey (text input) | 0.1s sleep |
captureWindow | 1s sleep (file write) |
| ACP context bootstrap | waitFor(acpReady, timeout=8000) |
| ACP picker open | waitFor(acpPickerOpen, timeout=3000) |
| ACP picker accept | waitFor(acpItemAccepted, timeout=3000) |
| ACP response streaming | 10-20s or waitFor(acpStatus) |
Rule: Use waitFor for all ACP state transitions. Only use fixed sleeps
for macOS focus-settling (0.3s) and file I/O (1s screenshot write).
Rule: Do not add a fixed sleep 3 after session.sh start. The session
wrapper is responsible for readiness. Only use the 0.5s fallback when ready=false.
Session Management
Use scripts/agentic/session.sh instead of hand-rolling mkfifo + exec 3> in ad hoc shells.
Why: The exec 3>"$PIPE" pattern ties the pipe to a single shell process. When a coding agent
spawns a new shell (e.g., follow-up verification step), fd 3 does not exist and the session is lost.
The session wrapper uses a background forwarder process so any shell can send commands via
session.sh send.
Rules:
- •Always use
session.sh startinstead of manualmkfifo+exec 3>for new verification flows - •Use
session.sh sendfor fire-and-forget stdin commands likeshow,triggerBuiltin,setFilter, andcaptureWindow - •Use
session.sh rpcfor protocol requests that expect a typed response likegetAcpState,getElements,waitFor,batch, andinspectAutomationWindow - •Check session health with
session.sh statusorsession-state.tsbefore sending commands - •Stop sessions with
session.sh stopwhen done — do not leave orphan processes - •Treat cleanup as part of the test itself: a run is incomplete until the session is stopped and verified dead
Screenshot Assertion (verify-shot.ts)
Use verify-shot.ts for automated screenshot + state verification. It enforces
the correct ACP verification order: state receipt first, screenshot second.
# Basic: capture screenshot with ACP state assertions bun scripts/agentic/verify-shot.ts --session default \ --label step-name \ --acp-status idle \ --acp-picker-closed \ --acp-context-ready # Assert picker is open after typing @ bun scripts/agentic/verify-shot.ts --session default \ --label picker-open \ --acp-picker-open # Assert item was accepted after Enter/Tab bun scripts/agentic/verify-shot.ts --session default \ --label item-accepted \ --acp-picker-closed \ --acp-item-accepted # State-only (skip screenshot) bun scripts/agentic/verify-shot.ts --session default \ --label quick-check \ --skip-screenshot \ --acp-input-contains "@context" # Screenshot-only (skip state query) bun scripts/agentic/verify-shot.ts --session default \ --label visual-check \ --skip-state
Available assertions:
| Flag | Checks |
|---|---|
--acp-status STATUS | ACP status equals value (idle, streaming, etc.) |
--acp-picker-open | Picker overlay is visible |
--acp-picker-closed | Picker overlay is closed |
--acp-input-contains STR | Input text contains substring |
--acp-input-match STR | Input text matches exactly |
--acp-cursor-at N | Cursor at character index N |
--acp-item-accepted | A picker item was accepted (lastAcceptedItem non-null) |
--acp-accepted-label STR | lastAcceptedItem.label equals STR |
--acp-accepted-trigger STR | lastAcceptedItem.trigger equals STR (@ or /) |
--acp-accepted-via KEY | Probe confirms acceptance via enter or tab |
--acp-cursor-after-accepted N | Probe confirms cursor landed at index N after acceptance |
--acp-context-ready | Context bootstrap complete |
--acp-no-selection | No text selection active (hasSelection is false) |
--acp-has-selection | Text selection is active (hasSelection is true) |
--acp-no-permission | No pending permission (hasPendingPermission is false) |
--acp-has-permission | Pending permission present (hasPendingPermission is true) |
--acp-visible-start N | inputLayout.visibleStart equals N (first visible char index) |
--acp-visible-end N | inputLayout.visibleEnd equals N (last visible char index) |
--acp-cursor-in-window N | inputLayout.cursorInWindow equals N (cursor position in viewport) |
Proof bundle fields: The receipt includes stable top-level fields for machine consumption:
state (ACP snapshot), probe (test probe snapshot), screenshot (path + capture metadata),
captureTarget (requested vs actual window ID for identity proof),
visionCrops (structured image check entries). These are the canonical fields for automated parsing.
Capture identity threading: Detached ACP screenshots use the inspected
native osWindowId, not the automation window ID. When --target-json is
present, verify-shot.ts auto-lifts inspection.osWindowId into the screenshot
step. An explicit --capture-window-id is only an override and must match the
inspected osWindowId. The receipt exposes captureTarget.requestedWindowId,
captureTarget.actualWindowId, captureRouting, requestedAutomationWindowId,
and inspectionOsWindowId.
Exit codes: 0 = pass, 1 = assertion failure, 2 = infrastructure error.
Canonical input-stability proof
Use visible-text-window assertions to verify single-line input rendering and cursor tracking without a screenshot:
bun scripts/agentic/verify-shot.ts --session default \ --label input-stability \ --skip-screenshot \ --acp-visible-start 12 \ --acp-visible-end 52 \ --acp-cursor-in-window 39
This proves the cursor is within the visible window and the viewport bounds are stable, which catches scroll jumps, layout shifts, and cursor-out-of-view regressions.
Strict capture: When ACP assertions are present, verify-shot.ts requires
window.ts quartz capture with frontmost confirmation and the exact inspected
native window ID. If focus drifts, the inspected osWindowId is missing, or the
captured windowId differs from the requested ID, the run fails instead of
silently falling back to a full-screen screenshot.
Rule: The recipe must fail when ACP state contradicts expected picker/caret outcome, even if the screenshot capture itself succeeds. State receipt is the primary proof; screenshot is secondary visual confirmation.
Recipe Orchestrator (index.ts) — Preferred ACP Verification
Always prefer the canonical CLI over ad hoc shell sequences. The orchestrator encodes the correct verification order, focus enforcement, probe resets, and checkpoint strategy so agents do not need to reconstruct these from scratch.
Canonical ACP proof commands
# Full ACP picker accept — choose key with --key enter|tab bun scripts/agentic/index.ts acp-accept --session default --key enter bun scripts/agentic/index.ts acp-accept --session default --key tab --vision # Target a specific ACP window (detached/popup) — resolve exact identity first RESOLVED="$(bun scripts/agentic/automation-window.ts resolve --session default --kind acpDetached --index 0)" TARGET="$(printf '%s' "$RESOLVED" | jq -c '.targetJson')" SURFACE_ID="$(printf '%s' "$RESOLVED" | jq -r '.surfaceId')" bun scripts/agentic/index.ts acp-accept --session default --key enter \ --target-json "$TARGET" --surface "$SURFACE_ID" --vision
Target threading (non-negotiable for multi-window ACP)
When verifying a detached or popup ACP window, resolve one target once and reuse it for every RPC and native input step in the entire run.
Canonical rule:
- •Discover the surface (e.g.,
bun scripts/agentic/window.ts list). - •Pick one
--target-jsonobject (e.g.,{"type":"kind","kind":"acpDetached","index":0}). - •Pass that same target to every ACP RPC:
getAcpState,getAcpTestProbe,resetAcpTestProbe,waitFor, andbatch. - •Pass the matching
--surfacevalue to native input so focus and proof stay on the same window. - •Never mix focused-window ACP RPCs with surface-targeted native input in the same verification run. This causes cross-window false proof where you drive one ACP surface and verify another.
The --target-json flag threads through index.ts → verify-shot.ts → every
RPC command, and the --surface flag threads through index.ts → macos-input.ts
→ window.ts for focus enforcement.
When --target-json is omitted, RPCs default to the main ACP view (existing behavior).
What acp-accept guarantees:
- •Resets ACP test probe before native interaction (no stale accepted items)
- •Uses
macos-input.ts --ensure-focusfor native typing and acceptance - •Uses state-only checks for ACP-ready and picker-open (no intermediate screenshots)
- •Waits for
acpAcceptedViaKey(key-specific proof, not genericacpItemAccepted) - •Keeps exactly one final screenshot as visual proof
- •Emits vision crops only when
--visionis requested - •When
--visionis used, surfaces the full proof bundle (withstate,probe,screenshot,visionCrops) asproofBundlein the recipe receipt
Other recipes
# Check all prerequisites bun scripts/agentic/index.ts preflight --session default # Open ACP and verify ready state (state-only, no screenshot) bun scripts/agentic/index.ts acp-open --session default # Compatibility aliases (same as --key enter / --key tab) bun scripts/agentic/index.ts acp-enter-accept --session default bun scripts/agentic/index.ts acp-tab-accept --session default
State-only vs screenshot checkpoints
| Checkpoint | Screenshot? | Probe? | Why |
|---|---|---|---|
| ACP ready | No | No | waitFor(acpReady) is sufficient proof; screenshot is waste |
| Picker open | No | No | waitFor(acpPickerOpen) is sufficient proof |
| Final accepted | Yes | Yes | The only checkpoint that needs visual + probe evidence |
Rule: Intermediate checkpoints use state-only verification (--skip-screenshot --skip-probe).
Only the final acceptance step captures a screenshot and queries the probe.
Receipt shape
Each recipe returns a machine-readable JSON receipt:
{
"schemaVersion": 1,
"recipe": "acp-enter-accept",
"status": "pass",
"steps": [
{ "name": "acp-open", "status": "pass" },
{ "name": "reset-probe", "status": "pass" },
{ "name": "type-at-trigger", "status": "pass" },
{ "name": "wait-accepted-via-key", "status": "pass" },
{ "name": "verify-accepted", "status": "pass" }
]
}
When --vision is used, a proofBundle field is added containing the verify-shot receipt
with state, probe, screenshot, and visionCrops for direct machine consumption.
The wrapper does not replace the lower-level commands — use session.sh,
macos-input.ts, window.ts, and verify-shot.ts directly when you need
finer control.
ACP Golden Path (Critical)
The mandatory verification flow for any ACP interaction testing.
Prefer the canonical CLI (bun scripts/agentic/index.ts acp-accept) over
reconstructing the manual steps below.
Canonical (one command, fully non-interactive)
bash scripts/agentic/session.sh start default bun scripts/agentic/index.ts acp-accept --session default --key enter --vision # The recipe returns a machine-readable JSON receipt with proofBundle. # Parse proofBundle.state, proofBundle.probe, proofBundle.screenshot, proofBundle.visionCrops # to verify ACP behavior programmatically, then read the written PNG for final visual confirmation. bash scripts/agentic/session.sh stop default
Exact detached ACP proof (preferred)
The scenario recipe resolves one exact detached ACP target once, reuses
the exact targetJson for every subsequent step, and emits a structured
proof bundle recording windowId, surfaceId, and ordered step receipts.
bash scripts/agentic/session.sh start default bun scripts/agentic/index.ts scenario \ --session default \ --scenario detached-acp-exact-id \ --index 0 bash scripts/agentic/session.sh stop default
The proof bundle is the regression substrate — every step records the exact
target used, the full request/response pair, and a timestamp. Exit code 0
means all steps succeeded; exit code 1 means some steps produced warnings.
Canonical with target threading (detached/popup ACP)
For finer-grained control (e.g., picker acceptance flows), resolve one exact
ACP target once and reuse both the target and surfaceId for the full run.
Do not use loose family-level --surface acp — use the exact surfaceId
from the resolver.
bash scripts/agentic/session.sh start default # Resolve exact target and surface identity once RESOLVED="$(bun scripts/agentic/automation-window.ts resolve --session default --kind acpDetached --index 0)" TARGET="$(printf '%s' "$RESOLVED" | jq -c '.targetJson')" SURFACE_ID="$(printf '%s' "$RESOLVED" | jq -r '.surfaceId')" bun scripts/agentic/index.ts acp-accept --session default --key enter \ --target-json "$TARGET" --surface "$SURFACE_ID" --vision INSPECTED="$(bun scripts/agentic/automation-window.ts inspect --session default --id "$(printf '%s' "$RESOLVED" | jq -r '.automationWindowId')")" WINDOW_ID="$(printf '%s' "$INSPECTED" | jq -r '.osWindowId')" bun scripts/agentic/index.ts acp-accept --session default --key enter \ --target-json "$TARGET" --surface "$SURFACE_ID" --vision bun scripts/agentic/verify-shot.ts --session default --label detached-proof \ --target-json "$TARGET" --capture-window-id "$WINDOW_ID" # Confirm proofBundle.state.resolvedTarget.windowKind == "acpDetached" # Confirm captureTarget.requestedWindowId == captureTarget.actualWindowId bash scripts/agentic/session.sh stop default
The --vision flag makes the recipe return a proofBundle containing all
machine-readable proof. The golden path is complete when the exit code is 0
and the proofBundle.state and proofBundle.probe fields confirm the expected
ACP state. Screenshot files are still written for archival but are not the
primary verification mechanism.
Manual (when you need finer control)
1. session start → session alive 2. show → window visible 3. triggerBuiltin tab-ai → ACP opens 4. waitFor(acpReady, timeout=8000) → context bootstrapped (deterministic) 5. focus window → frontmost confirmed 6. native type @ (macos-input.ts --ensure-focus) → open picker 7. waitFor(acpPickerOpen, timeout=3000) → picker open (deterministic) 8. native Enter or Tab (macos-input.ts --ensure-focus) → accept picker item 9. waitFor(acpAcceptedViaKey, timeout=3000) → key-specific acceptance (deterministic) 10. verify-shot.ts with --acp-accepted-via → state + probe + screenshot proof
Key tools in the golden path:
| Tool | Role |
|---|---|
session.sh | Cross-shell session management, RPC, lifecycle |
macos-input.ts | Native macOS keyboard/mouse with --ensure-focus |
window.ts | Window discovery, focus, activation, quartz capture |
verify-shot.ts | State + probe + screenshot bundle with strict capture |
automation-window.ts | Exact ACP target-to-surface resolver |
scenario.ts | Replayable proof-bundle scenarios for cross-window regression |
index.ts | Orchestrator that composes all of the above correctly |
waitFor replaces fixed sleeps. Each waitFor polls at 25ms intervals
and returns a waitForResult receipt with success, elapsed, and an
optional trace (included automatically on failure when trace: "onFailure").
State receipt before screenshot is non-negotiable. If the state says the picker is still open but the screenshot looks fine, the test must FAIL.
Any remaining sleeps in the recipes are brief macOS focus-settling delays (~300ms) with explicit comments. They are not proof of ACP state.
Verification Recipes
See references/recipes.md for named verification patterns.
Key Gotchas
- •
simulateKeydoes NOT go through GPUI'sintercept_keystrokes(). UsetriggerBuiltinfor ACP Chat entry, notsimulateKeyTab. - •
AcpChatViewaccepts single-charsimulateKeyfor typing,enterfor submit,w+cmd for close. - •The app window auto-hides when focus is lost. If captures fail with "Window not found", the window was dismissed.
- •
captureWindowfilters out windows under 100x100 (tray icons). - •Always unset API keys if you need the setup card:
unset ANTHROPIC_API_KEY. - •For ACP picker testing, use native macOS input (
macos-input.ts --ensure-focus) instead ofsimulateKey— synthetic keys bypass GPUI's native key interception and do not faithfully exercise picker selection behavior. - •Use
getAcpStateto verify picker acceptance, cursor landing, and input content — do not rely solely on screenshots for ACP state verification. - •Use
waitForcommands viasession.sh rpcfor deterministic ACP state transitions — do not use fixed sleeps as proof of ACP state.