Genie Space Optimization
Interactive, agent-driven optimization loop that improves Genie Space quality through systematic benchmark testing, evaluation, and iterative control lever adjustments.
When to Use This Skill
- •Optimizing Genie Space accuracy or repeatability scores
- •Writing benchmark questions for a new or existing Genie Space
- •Debugging incorrect SQL generation from Genie
- •Improving asset routing (TVF vs Metric View selection)
- •Running automated optimization sessions against live Genie Spaces
- •Updating Genie Space configurations via API after testing
- •Managing Genie export JSON files with dual persistence
Hand Off to Related Skills
| User Says / Task Involves | Load Instead |
|---|---|
| "create Genie Space from scratch" | genie-space-patterns |
| "deploy Genie Space via API" | genie-space-export-import-api |
| "create metric view" | metric-views-patterns |
| "create TVF" | databricks-table-valued-functions |
This skill covers the optimization loop: test → evaluate → adjust → re-test. Other skills cover creation and deployment of individual assets.
Quality Dimensions
| Dimension | Target | Definition |
|---|---|---|
| Accuracy | 95%+ | Does Genie return correct SQL/answers for benchmark questions? |
| Repeatability | 90%+ | Does the same question produce the same SQL across multiple runs? |
Core Optimization Loop
┌──────────────────────────────────────────────────────┐ │ 1. WRITE BENCHMARKS │ │ Generate domain-relevant benchmark questions │ │ with expected SQL answers │ ├──────────────────────────────────────────────────────┤ │ 2. QUERY GENIE │ │ Run benchmarks via Conversation API │ │ (respect 12s rate limit between queries) │ ├──────────────────────────────────────────────────────┤ │ 3. EVALUATE │ │ Compare generated SQL to expected SQL │ │ Measure accuracy + repeatability │ ├──────────────────────────────────────────────────────┤ │ 4. DIAGNOSE │ │ Identify failure patterns: │ │ - Wrong asset selected (TVF vs MV vs Table) │ │ - Wrong columns/aggregations │ │ - Missing routing context │ ├──────────────────────────────────────────────────────┤ │ 5. APPLY CONTROL LEVERS │ │ Fix issues using 6 levers (priority order) │ │ Apply via BOTH API + repository (dual persist) │ ├──────────────────────────────────────────────────────┤ │ 6. RE-TEST │ │ Wait 30s → re-run failing questions │ │ Measure improvement │ │ Loop back to step 4 if targets not met │ ├──────────────────────────────────────────────────────┤ │ 7. DOCUMENT │ │ Generate optimization report with metrics │ └──────────────────────────────────────────────────────┘
Step 1: Write Benchmark Questions
Generate benchmark questions that cover the domain's key use cases. Each question must have:
- •Natural language question (what a user would ask)
- •Expected SQL (tested, working SQL)
- •Expected asset (MV, TVF, or TABLE)
- •Category (aggregation, list, time-series, comparison, etc.)
Asset Routing Rules for Benchmarks
| Question Type | Expected Asset | Example |
|---|---|---|
| Total/average/overall | Metric View | "What is total spend?" |
| Top N / list / show me | TVF | "Show top 10 costliest jobs" |
| Date-range analysis | TVF | "Daily costs for last month" |
| KPI / single value | Metric View | "Average job duration" |
| Drill-down / detail | TVF | "Details for workspace X" |
See Benchmark Patterns for complete question-writing guide.
Quick Benchmark Template
benchmarks:
- id: "cost_001"
question: "What is total spend this month?"
expected_sql: "SELECT MEASURE(total_cost) FROM ${catalog}.${schema}.mv_cost_analytics WHERE ..."
expected_asset: "MV"
category: "aggregation"
- id: "cost_002"
question: "Show top 10 costliest workspaces"
expected_sql: "SELECT * FROM ${catalog}.${schema}.get_top_cost_contributors('7', 'workspace')"
expected_asset: "TVF"
category: "list"
See Golden Queries Template for the full YAML format.
Step 2: Query Genie via API
Core Query Function
import time
from databricks.sdk import WorkspaceClient
w = WorkspaceClient()
def run_genie_query(space_id: str, question: str, max_wait: int = 120) -> dict:
"""Execute a query against Genie and return SQL + status."""
try:
resp = w.genie.start_conversation(space_id=space_id, content=question)
conversation_id = resp.conversation_id
message_id = resp.message_id
poll_interval = 3
start = time.time()
while time.time() - start < max_wait:
time.sleep(poll_interval)
msg = w.genie.get_message(
space_id=space_id,
conversation_id=conversation_id,
message_id=message_id
)
status = str(msg.status) if hasattr(msg, 'status') else 'UNKNOWN'
if any(s in status for s in ['COMPLETED', 'FAILED', 'CANCELLED']):
break
poll_interval = min(poll_interval + 1, 10)
sql = None
if hasattr(msg, 'attachments') and msg.attachments:
for att in msg.attachments:
if hasattr(att, 'query') and att.query:
sql = att.query.query if hasattr(att.query, 'query') else str(att.query)
return {"status": status, "sql": sql}
except Exception as e:
return {"status": "ERROR", "sql": None, "error": str(e)}
Rate Limiting (CRITICAL)
Databricks enforces 5 POST requests/minute/workspace. Always wait 12+ seconds between queries:
time.sleep(12) # MANDATORY between every Genie API call
Step 3: Evaluate Results
Accuracy Evaluation
Compare generated SQL against expected SQL:
def evaluate_accuracy(result: dict, expected: dict) -> dict:
"""Evaluate if Genie returned correct SQL."""
generated_sql = (result.get("sql") or "").lower().strip()
expected_sql = expected.get("expected_sql", "").lower().strip()
# Check asset routing
expected_asset = expected.get("expected_asset", "").upper()
uses_mv = "mv_" in generated_sql or "measure(" in generated_sql
uses_tvf = "get_" in generated_sql
actual_asset = "MV" if uses_mv else ("TVF" if uses_tvf else "TABLE")
return {
"question": expected["question"],
"sql_generated": result.get("status") == "COMPLETED" and bool(generated_sql),
"correct_asset": actual_asset == expected_asset,
"actual_asset": actual_asset,
"expected_asset": expected_asset,
}
Repeatability Testing
Run each question 3 times and measure consistency:
import hashlib
from collections import Counter
def test_repeatability(space_id: str, question: str, iterations: int = 3) -> dict:
"""Test SQL consistency across multiple runs."""
hashes = []
for i in range(iterations):
sql = run_genie_query(space_id, question).get("sql", "")
sql_hash = hashlib.md5(sql.lower().encode()).hexdigest()[:8] if sql else "NONE"
hashes.append(sql_hash)
time.sleep(12)
most_common_count = Counter(hashes).most_common(1)[0][1]
return {
"question": question,
"repeatability_pct": (most_common_count / len(hashes)) * 100,
"unique_variants": len(set(hashes)),
}
Score Thresholds
| Score | Classification | Action Required |
|---|---|---|
| 100% | Identical | No action needed |
| 70-99% | Minor variance | Usually acceptable |
| 50-69% | Significant variance | Update instructions or add sample query |
| <50% | Critical variance | Must fix with explicit routing rules |
Step 4: Diagnose Failures
Common failure patterns and their root causes:
| Symptom | Root Cause | Lever to Fix |
|---|---|---|
| Wrong asset selected (MV vs TVF) | Ambiguous routing | Genie Instructions (lever 6) |
| Wrong columns in query | Poor column descriptions | UC Table/Column comments (lever 1) |
| Missing aggregation | Metric view not discoverable | Metric View metadata (lever 2) |
| Wrong parameters to TVF | TVF comment unclear | TVF COMMENT (lever 3) |
| Inconsistent SQL across runs | Ambiguous question mapping | Genie Instructions + sample query |
| Query returns no data | Wrong date range defaults | Genie Instructions (time defaults) |
Step 5: Apply Control Levers (Priority Order)
Six levers to fix Genie behavior, applied in priority order:
| Priority | Lever | Durability | Char Limit | When to Use |
|---|---|---|---|---|
| 1 | UC Tables & Columns | Highest | Unlimited | Column misunderstanding, wrong tables |
| 2 | Metric Views | High | N/A | Metric/aggregation questions wrong |
| 3 | TVFs (Functions) | High | N/A | Complex calculation errors |
| 4 | Monitoring Tables | Medium | Unlimited | Time-series queries wrong |
| 5 | ML Model Tables | Medium | Unlimited | Prediction queries wrong |
| 6 | Genie Instructions | Lowest | ~4000 chars | Asset routing, last resort |
Why this order? UC metadata survives Genie Space rebuilds. Genie Instructions have a ~4000 char limit and are the least durable.
Dual Persistence (CRITICAL)
Every optimization MUST be applied in TWO places:
| Step | Action | Why |
|---|---|---|
| 1. Direct Update | API call or ALTER TABLE | Effective immediately |
| 2. Repository Update | Update source files | Persists across deployments |
See Control Levers Reference for detailed per-lever update patterns.
Step 6: Re-Test and Iterate
After applying changes:
- •Wait 30 seconds for propagation
- •Re-run only failing/variable questions (not full suite)
- •Measure improvement (compare before/after)
- •Loop back to Step 4 if targets not met
# Re-test after optimization
time.sleep(30) # Wait for propagation
for question in failing_questions:
result = run_genie_query(space_id, question["question"])
evaluation = evaluate_accuracy(result, question)
print(f" {'PASS' if evaluation['correct_asset'] else 'FAIL'}: {question['question']}")
time.sleep(12)
Step 7: Document Results
Generate an optimization report using the Report Template.
Key sections:
- •Executive summary with before/after metrics
- •Per-question test results
- •Optimizations applied (which levers, what changed)
- •Dual persistence confirmation
- •Files updated
Scripts
genie_optimizer.py
Complete optimization loop: load benchmarks → query Genie → evaluate → report.
# Run from Databricks notebook or local with SDK configured python scripts/genie_optimizer.py --space-id <ID> --benchmarks golden-queries.yaml
repeatability_tester.py
Standalone repeatability testing with configurable iterations.
python scripts/repeatability_tester.py --space-id <ID> --iterations 3
Reference Files
- •Control Levers: Detailed per-lever update patterns, SQL commands, API calls, and repository file mappings for dual persistence
- •Optimization Workflow: Extended step-by-step workflow with decision trees, failure pattern analysis, and advanced techniques
- •Benchmark Patterns: Complete guide to writing effective benchmark questions including category coverage, SQL templates, and domain-specific patterns
Validation Checklist
Before Optimization
- • Correct Space ID identified (see domain reference table)
- • Benchmark questions loaded with expected SQL
- • Current Genie export config backed up
- • Rate limiting in place (12s between queries)
During Optimization
- • Accuracy tests run for all benchmark questions
- • Repeatability tests run (2-3 iterations per question)
- • Root causes identified for each failure
- • Control levers applied in priority order (1-6)
After Optimization
- • Direct update applied (API/ALTER TABLE)
- • Repository file updated (dual persistence)
- • Template variables preserved (
${catalog},${gold_schema}) - • All arrays sorted in Genie export JSON
- • Wait 30s, then re-test failing questions
- • Improvement measured and documented
- • Optimization report generated
Common Mistakes
Forgetting Array Sorting in API Updates
# WRONG: Unsorted → "data_sources.tables must be sorted"
payload = {"serialized_space": json.dumps(config)}
# CORRECT: Sort all arrays first
config = sort_genie_config(config)
payload = {"serialized_space": json.dumps(config)}
Skipping Dual Persistence
# WRONG: API-only update (lost on next deployment)
subprocess.run(["databricks", "api", "patch", ...])
# CORRECT: API + repository
subprocess.run(["databricks", "api", "patch", ...]) # Direct
with open("src/genie/config.json", "w") as f: json.dump(...) # Repository
Ignoring Rate Limits
# WRONG: Rapid-fire → queries fail silently
for q in questions: run_genie_query(space_id, q)
# CORRECT: 12s between each query
for q in questions:
run_genie_query(space_id, q)
time.sleep(12)
Putting Everything in Genie Instructions
Instructions have a ~4000 char limit. Fix column descriptions first (lever 1), then metric views (lever 2), then TVFs (lever 3). Use instructions only for routing rules.
Related Skills
- •
genie-space-patterns- Initial Genie Space creation and setup - •
genie-space-export-import-api- Programmatic deployment via REST API - •
metric-views-patterns- Metric view creation and validation - •
databricks-table-valued-functions- TVF creation for Genie
References
Official Databricks Documentation
Version History
- •v1.0 (Feb 2026) - Initial skill converted from
34-genie-space-optimization.mdc- •Interactive optimization loop (write → query → evaluate → adjust → re-test)
- •Six control levers with priority ordering
- •Dual persistence requirement for all changes
- •Repeatability testing methodology
- •Benchmark question writing patterns
- •Domain-specific Space ID reference
- •Rate limiting enforcement (12s between queries)
- •Benchmarked results: Quality 100%, Reliability 80%, Security 67%, Performance 47%
- •Key discovery: TVF-first routing improves repeatability