AgentSkillsCN

databricks-table-valued-functions

全面指导如何规划、创建、部署并验证适用于 Genie Space 自然语言查询的 Databricks 表值函数(TVF)。适用于为 Genie Spaces 创建 TVF、根据业务需求梳理 TVF 的各项需求、排查 TVF 编译错误,或确保 TVF 与 Genie 的兼容性。方案包含需求收集模板、Schema 校验模式、SQL 要求(STRING 参数、参数顺序、LIMIT 替代方案)、v3.0 项目符号注释格式、空值安全处理、SCD2 处理策略、避免笛卡尔积问题,以及 5 个完整且可适配各领域的示例、Asset Bundle 部署模式,还有部署后的验证查询。

SKILL.md
--- frontmatter
name: databricks-table-valued-functions
description: End-to-end guide for planning, creating, deploying, and validating Table-Valued Functions (TVFs) in Databricks optimized for Genie Space natural language queries. Use when creating TVFs for Genie Spaces, planning TVF requirements from business questions, troubleshooting TVF compilation errors, or ensuring Genie compatibility. Includes requirements gathering templates, schema validation patterns, SQL requirements (STRING parameters, parameter ordering, LIMIT workarounds), v3.0 bullet-point comment format, null safety, SCD2 handling, cartesian product prevention, 5 complete domain-adaptable examples, Asset Bundle deployment patterns, and post-deployment validation queries.
metadata:
  author: prashanth subrahmanyam
  version: "2.0"
  domain: semantic-layer
  role: worker
  pipeline_stage: 6
  pipeline_stage_name: semantic-layer
  called_by:
    - semantic-layer-setup
  standalone: true
  last_verified: "2026-02-07"
  volatility: medium
  upstream_sources: []  # Internal TVF patterns

Databricks Table-Valued Functions (TVFs) for Genie

Overview

Table-Valued Functions (TVFs) provide pre-built, parameterized queries that Genie can invoke for natural language access to your data. This skill covers the entire TVF lifecycle: planning, creation, deployment, and validation.

Why TVFs Matter:

  • Pre-built queries ensure consistent business logic across all consumers
  • LLM-friendly metadata helps Genie understand when to use each function
  • Proper SQL patterns prevent compile-time errors and data inflation bugs
  • Parameterized access gives users instant answers to common questions

Key Capabilities:

  • Plan TVFs from business questions using requirements gathering templates
  • Create TVFs with Genie-compatible parameter types (STRING for dates)
  • Validate schemas before writing SQL to prevent 100% of compilation errors
  • Structure comments (v3.0 bullet-point format) for optimal Genie query matching
  • Handle SCD2 dimensions with proper is_current filtering
  • Prevent cartesian products in aggregation CTEs
  • Use proper parameter ordering and LIMIT workarounds
  • Deploy TVFs via Asset Bundle jobs
  • Validate TVFs with post-deployment queries

What TVFs Provide:

  • ✅ Parameterized queries with input parameters
  • ✅ LLM-friendly metadata for Genie understanding
  • ✅ Genie-compatible types (STRING for dates, not DATE)
  • ✅ Top N patterns using WHERE rank <= param (not LIMIT param)
  • ✅ Null safety with NULLIF for all divisions
  • ❌ No DATE parameters (Genie doesn't support DATE type)
  • ❌ No LIMIT with parameters (use ROW_NUMBER + WHERE instead)

When to Use This Skill

Use this skill when:

  • Planning which TVFs to create from business questions
  • Creating TVFs for Genie Spaces
  • Troubleshooting TVF compilation errors
  • Ensuring Genie compatibility
  • Validating schemas before writing SQL
  • Deploying TVFs via Asset Bundles
  • Preventing common SQL errors (parameter types, LIMIT clauses, cartesian products)

Quick Start (2-3 hours)

Goal: Create 10-15 pre-built, parameterized SQL queries for common business questions.

What You'll Create:

  1. table_valued_functions.sql — SQL file with 10-15 TVF definitions
  2. tvf_job.yml — Asset Bundle job (SQL task) for deployment

Fast Track:

sql
CREATE OR REPLACE FUNCTION get_top_stores_by_revenue(
    start_date STRING COMMENT 'Start date (format: YYYY-MM-DD)',
    end_date STRING COMMENT 'End date (format: YYYY-MM-DD)',
    top_n INT DEFAULT 10 COMMENT 'Number of top stores to return'
)
RETURNS TABLE(
    rank INT COMMENT 'Store rank by revenue',
    store_name STRING COMMENT 'Store display name',
    total_revenue DECIMAL(18,2) COMMENT 'Total revenue for period'
)
COMMENT '
• PURPOSE: Returns top N stores ranked by revenue for a date range
• BEST FOR: "What are the top 10 stores by revenue?" | "Show me best performing stores"
• RETURNS: Individual store rows (rank, store_name, total_revenue)
• PARAMS: start_date, end_date, top_n (default: 10)
• SYNTAX: SELECT * FROM get_top_stores_by_revenue(''2024-01-01'', ''2024-12-31'', 10)
'
RETURN ...;

Common Business Naming Patterns:

PatternTemplateExample
Top Nget_top_{entity}_by_{metric}(start_date, end_date, top_n)get_top_stores_by_revenue(...)
Trendingget_{metric}_trend(start_date, end_date)get_daily_sales_trend(...)
Comparisonget_{metric}_by_{dimension}(start_date, end_date)get_sales_by_state(...)
Performanceget_{entity}_performance(entity_id, start_date, end_date)get_store_performance(...)

Critical SQL Rules:

  • STRING for date parameters (never DATE)
  • Required parameters first, DEFAULT parameters last
  • ROW_NUMBER + WHERE for Top N (never LIMIT with parameter)

Output: 10-15 TVFs callable by Genie and queryable via SQL

See references/tvf-planning-guide.md for question categorization and domain examples. See references/tvf-examples.md for 5 complete, production-ready TVF implementations. See assets/templates/tvf-requirements-template.md to plan your TVFs before coding.

Critical Rules

⚠️ CRITICAL: Schema Validation BEFORE Writing SQL

RULE #0: Always consult YAML schema definitions before writing any TVF SQL

100% of SQL compilation errors are caused by not consulting YAML schemas first.

Pre-Development Checklist:

  1. Read YAML schema files (5 minutes)
  2. Create SCHEMA_MAPPING.md (2 minutes)
  3. Write TVF SQL using documented schema
  4. Run validation script (30 sec)
  5. Deploy

ROI: 71% time reduction (45 min → 13 min)
First-Time Success Rate: 0% → 95%+

See references/tvf-patterns.md for detailed schema validation workflow.

⚠️ Issue 1: Parameter Types for Genie Compatibility

RULE: Use STRING for date parameters, not DATE

Genie Spaces do not support DATE type parameters. Always use STRING with explicit format documentation.

❌ DON'T:

sql
CREATE FUNCTION get_sales_by_date_range(
  start_date DATE COMMENT 'Start date',
  end_date DATE COMMENT 'End date'
)

✅ DO:

sql
CREATE FUNCTION get_sales_by_date_range(
  start_date STRING COMMENT 'Start date (format: YYYY-MM-DD)',
  end_date STRING COMMENT 'End date (format: YYYY-MM-DD)'
)
...
WHERE transaction_date BETWEEN CAST(start_date AS DATE) AND CAST(end_date AS DATE)

⚠️ Issue 2: Parameter Ordering with DEFAULT Values

RULE: Parameters with DEFAULT must come AFTER parameters without DEFAULT

❌ DON'T:

sql
CREATE FUNCTION get_top_stores(
  top_n INT DEFAULT 10,          -- ❌ DEFAULT parameter first
  start_date STRING,              -- ❌ Required parameter after DEFAULT
  end_date STRING
)

✅ DO:

sql
CREATE FUNCTION get_top_stores(
  start_date STRING,              -- ✅ Required parameter first
  end_date STRING,                -- ✅ Required parameter
  top_n INT DEFAULT 10            -- ✅ Optional parameter last
)

⚠️ Issue 3: LIMIT Clauses Cannot Use Parameters

RULE: Use WHERE rank <= parameter instead of LIMIT parameter

LIMIT clauses require compile-time constants. Use WHERE with ROW_NUMBER() instead.

❌ DON'T:

sql
SELECT * FROM store_metrics
ORDER BY total_revenue DESC
LIMIT top_n;  -- ❌ Cannot use parameter here

✅ DO:

sql
WITH ranked_stores AS (
  SELECT ...,
    ROW_NUMBER() OVER (ORDER BY total_revenue DESC) as rank
  FROM store_metrics
)
SELECT * FROM ranked_stores
WHERE rank <= top_n  -- ✅ Can use parameter in WHERE
ORDER BY rank;

⚠️ CRITICAL: Cartesian Product Bug in Aggregation CTEs

Never re-join a table that's already been aggregated in a CTE.

❌ BUGGY PATTERN:

sql
WITH period_data AS (
  SELECT SUM(revenue) as total_revenue
  FROM fact_table
  GROUP BY period
),
final AS (
  SELECT SUM(pd.total_revenue), SUM(ft.other_metric)  -- 🔥 CARTESIAN!
  FROM period_data pd
  LEFT JOIN fact_table ft ON ...  -- ❌ Re-joining source!
)

✅ CORRECT PATTERN:

sql
SELECT 
  period,
  SUM(revenue) as total_revenue,
  SUM(other_metric) as other_metric
FROM fact_table
GROUP BY period;  -- ✅ Single aggregation pass

See references/tvf-patterns.md for detailed cartesian product prevention patterns.

Quick Reference

Standardized TVF Comment Format (v3.0)

Use bullet-point format for ALL TVF comments:

sql
COMMENT '
• PURPOSE: [One-line description of what the TVF does]
• BEST FOR: [Example questions separated by |]
• NOT FOR: [What to avoid - redirect to correct TVF] (optional)
• RETURNS: [PRE-AGGREGATED rows or Individual rows] (exact column list)
• PARAMS: [Parameter names with defaults]
• SYNTAX: SELECT * FROM tvf_name(''param1'', ''param2'')
• NOTE: [Important caveats - DO NOT wrap in TABLE(), etc.] (optional)
'

Complete TVF Pattern

sql
CREATE OR REPLACE FUNCTION get_top_stores_by_revenue(
  -- Required parameters first (no DEFAULT)
  start_date STRING COMMENT 'Start date (format: YYYY-MM-DD)',
  end_date STRING COMMENT 'End date (format: YYYY-MM-DD)',
  -- Optional parameters last (with DEFAULT)
  top_n INT DEFAULT 10 COMMENT 'Number of top stores to return'
)
RETURNS TABLE(
  rank INT COMMENT 'Store rank by revenue',
  store_number STRING COMMENT 'Store identifier',
  store_name STRING COMMENT 'Store name',
  total_revenue DECIMAL(18,2) COMMENT 'Total revenue for period',
  total_units BIGINT COMMENT 'Total units sold'
)
COMMENT '
• PURPOSE: Returns the top N stores ranked by revenue for a date range
• BEST FOR: "What are the top 10 stores by revenue?" | "Show me best performing stores"
• RETURNS: Individual store rows (rank, store_number, store_name, total_revenue, total_units)
• PARAMS: start_date, end_date, top_n (default: 10)
• SYNTAX: SELECT * FROM get_top_stores_by_revenue(''2024-01-01'', ''2024-12-31'', 10)
• NOTE: Returns user_id for individual store analysis | Sorted by total_revenue DESC
'
RETURN
  WITH store_metrics AS (
    SELECT 
      store_number,
      store_name,
      SUM(net_revenue) as total_revenue,
      SUM(net_units) as total_units
    FROM fact_sales_daily
    WHERE transaction_date BETWEEN CAST(start_date AS DATE) AND CAST(end_date AS DATE)
    GROUP BY store_number, store_name
  ),
  ranked_stores AS (
    SELECT 
      ROW_NUMBER() OVER (ORDER BY total_revenue DESC) as rank,
      store_number,
      store_name,
      total_revenue,
      total_units
    FROM store_metrics
  )
  SELECT * FROM ranked_stores
  WHERE rank <= top_n  -- ✅ Use WHERE instead of LIMIT
  ORDER BY rank;

Core Patterns

Null Safety

Always use NULLIF() for division to prevent divide-by-zero errors:

sql
-- ✅ DO: Null-safe division
total_revenue / NULLIF(transaction_count, 0) as avg_transaction_value

SCD Type 2 Dimension Handling

Always filter for current records when joining SCD2 dimensions:

sql
-- ✅ Correct: Filter for current version
LEFT JOIN dim_store ds 
  ON fsd.store_number = ds.store_number 
  AND ds.is_current = true

Aggregate vs Individual Row TVFs

Aggregate TVF (Returns Pre-Aggregated Rows):

  • Returns fixed number of rows (e.g., 5 segment rows)
  • Data is PRE-AGGREGATED - no GROUP BY needed on top
  • Do NOT use in JOINs (no user_id or other keys to join on)

Individual Row TVF (Returns Detail Rows):

  • Returns variable number of rows based on data
  • Each row represents one entity (customer, property, host)
  • CAN be used in JOINs (has identifier columns)

See references/genie-integration.md for detailed examples.

TVF Creation Checklist

SQL Compliance

  • All date parameters are STRING type (not DATE)
  • Required parameters come before optional parameters
  • No parameters used in LIMIT clauses (use WHERE rank <= param)
  • All divisions use NULLIF to prevent divide-by-zero
  • SCD2 joins include is_current = true filter
  • No cartesian products: CTEs don't re-join tables already aggregated
  • Single aggregation pass: Each source table read and aggregated only once

Genie Optimization (Standardized Comment Format)

  • Function COMMENT uses bullet-point format (• PURPOSE, • BEST FOR, etc.)
  • PURPOSE: One-line description of what TVF does
  • BEST FOR: 2+ example questions (pipe-separated)
  • NOT FOR / PREFERRED OVER: Redirect to correct asset when applicable
  • RETURNS: Specifies PRE-AGGREGATED or Individual rows + exact column list
  • PARAMS: Parameter names with defaults
  • SYNTAX: Exact copyable example with proper date format
  • NOTE: Caveats (DO NOT wrap in TABLE(), DO NOT add GROUP BY, etc.)
  • All parameters have descriptive COMMENT with format
  • All returned columns have COMMENT
  • Professional language (no "metric view is broken" phrases)

Testing

  • Function compiles without errors
  • Function executes with valid parameters
  • Function handles edge cases (empty results, null values)
  • Function tested in Genie Space (if applicable)
  • Results validated against metric view (ratio ≈ 1.0, not 254x)

Implementation Workflow

Phase 1: Planning (30 min)

  • Fill out requirements template (assets/templates/tvf-requirements-template.md)
  • List 10-15 common business questions from stakeholders
  • Categorize questions (revenue, product, entity, trend)
  • Map questions to TVF names and parameters using naming patterns
  • Identify required vs optional parameters

See references/tvf-planning-guide.md for question categories and domain examples.

Phase 2: SQL Development (1-2 hours)

  • Consult YAML schemas FIRST (Rule #0 — see references/tvf-patterns.md)
  • Create table_valued_functions.sql following file organization pattern
  • Implement each TVF following the template (assets/templates/tvf-template.sql)
  • Verify all date parameters are STRING type
  • Verify parameter ordering (required first)
  • Verify Top N uses ROW_NUMBER + WHERE (not LIMIT)
  • Verify all divisions use NULLIF

See references/tvf-examples.md for 5 complete TVF examples covering different patterns.

Phase 3: Metadata (30 min)

  • Add function-level COMMENT using v3.0 bullet-point format
  • Add 2+ example questions per function (BEST FOR)
  • Add COMMENT to every parameter (include format for STRING dates)
  • Add COMMENT to every returned column
  • Add NOT FOR / PREFERRED OVER cross-references where applicable

See references/genie-integration.md for comment format details and Genie misuse prevention.

Phase 4: Testing (30 min)

  • Compile each function (no syntax errors)
  • Execute with valid parameters
  • Test edge cases (empty results, null values)
  • Validate results against metric view (ratio ≈ 1.0)
  • Verify results match expectations

See scripts/validate_tvfs.sql for ready-to-run validation queries.

Phase 5: Deployment

  • Add to Asset Bundle job (sql_task with warehouse_id)
  • Deploy: databricks bundle deploy -t dev
  • Run: databricks bundle run gold_setup_job -t dev
  • Test in Genie Space (if applicable)

See references/tvf-patterns.md (Asset Bundle Deployment section) for job YAML patterns.

Time Estimates:

PhaseDurationActivities
Phase 1: Planning30 minRequirements gathering, question mapping
Phase 2: SQL Development1-2 hoursWrite 10-15 TVFs with schema validation
Phase 3: Metadata30 minv3.0 comments, parameter/column documentation
Phase 4: Testing30 minCompile, execute, validate against metric views
Phase 5: Deployment30 minAsset Bundle deploy, Genie Space testing
Total2-3 hoursFor 10-15 TVFs

Common Mistakes to Avoid

Don't:

  • Use DATE parameters (use STRING with CAST)
  • Mix DEFAULT and non-DEFAULT parameters
  • Use parameters in LIMIT clauses
  • Re-join tables after aggregation (cartesian product)
  • Skip schema validation before writing SQL

Do:

  • Validate schemas from YAML before coding
  • Use STRING for date parameters
  • Put required parameters before optional ones
  • Use WHERE rank <= param instead of LIMIT param
  • Single aggregation pass, no self-joins

Reference Files

  • references/tvf-patterns.md — SQL patterns, parameter types, cartesian product prevention, schema validation, SQL file organization, Asset Bundle deployment
  • references/genie-integration.md — Genie compatibility, v3.0 comment format, misuse prevention, professional language standards
  • references/tvf-examples.md — 5 complete production-ready TVF examples (ranking, drilldown, product, geographic, temporal)
  • references/tvf-planning-guide.md — Question categorization, TVF planning tables, domain-specific examples (Retail, Healthcare, Finance, Hospitality)

Assets

  • assets/templates/tvf-template.sql — Starter SQL template for new TVFs with v3.0 comment format
  • assets/templates/tvf-requirements-template.md — Requirements gathering template (fill in before coding)

Scripts

  • scripts/validate_tvfs.sql — Post-deployment validation queries (list, describe, test, compare to metric views)

References

Official Documentation

Related Skills

  • metric-views-patterns - Metric view YAML structure
  • genie-space-patterns - Genie Space setup

Version History

  • v2.0 (Feb 2026) - Merged comprehensive TVF creation workflow: Quick Start, requirements gathering template, 5 complete examples, planning guide with domain-specific examples, implementation workflow (5 phases), Asset Bundle deployment patterns, validation queries, common business naming patterns
  • v1.3 (Dec 16, 2025) - Standardized TVF comment format v3.0 for Genie optimization
  • v1.2 (Dec 15, 2025) - Critical bug prevention: Cartesian product in aggregations
  • v1.1 (Dec 2025) - Major enhancement: Schema-first development patterns
  • v1.0 (Oct 2025) - Initial rule based on 15 TVF deployment learnings