AgentSkillsCN

cube-modeling

Cube数据建模语言专家,擅长使用YAML或JavaScript语法构建语义层Cube、视图、度量、维度、关联、分段、预聚合及访问策略。在创建或修改Cube数据模型(.yml/.js文件)、设计Cube架构、为提升性能配置预聚合、设置Cube间的关联、定义计算度量,或实现多态Cube、数据融合、多阶段计算以及日历Cube等高级模式时,均可借助此专家的专业知识。涵盖完整的Cube建模语言参考与最佳实践。

SKILL.md
--- frontmatter
name: cube-modeling
description: >
  Cube data modeling language expert for building semantic layer cubes, views, measures, dimensions,
  joins, segments, pre-aggregations, and access policies using YAML or JavaScript syntax. Use when
  creating or modifying Cube data models (.yml/.js files), designing cube schemas, configuring
  pre-aggregations for performance, setting up joins between cubes, defining calculated measures,
  or implementing advanced patterns like polymorphic cubes, data blending, multi-stage calculations,
  and calendar cubes. Covers the complete Cube modeling language reference and best practices.

Cube Data Modeling

Reference Index

Load the relevant reference file based on the task:

TaskReference File
Define cubes (name, sql_table, sql, extends, refresh_key, data_source) or views (cubes, join_path, includes/excludes, folders)cubes-and-views.md
Define measures (type, sql, filters, rolling_window, time_shift, drill_members, format) or dimensions (type, sql, primary_key, sub_query, granularities, case) or hierarchiesmeasures-and-dimensions.md
Configure joins (relationship, sql) or segments (sql)joins-and-segments.md
Configure pre-aggregations (type, measures, dimensions, time_dimension, granularity, partition_granularity, refresh_key, indexes)pre-aggregations.md
Implement calculated measures, proxy/subquery dimensions, multi-stage calculations, extends, polymorphic cubes, data blending, or calendar cubesadvanced-patterns.md
Look up measure/dimension types, formats, or context variables (CUBE, FILTER_PARAMS, SQL_UTILS, COMPILE_CONTEXT)types-formats-variables.md
Use cube Python package or cube_dbt for dbt integrationpackages.md

YAML Syntax Essentials

File Structure

code
model/
├── cubes/
│   ├── orders.yml
│   └── users.yml
└── views/
    └── orders_view.yml

References in YAML

yaml
# Column reference
sql: status

# Member reference (same cube)
sql: "{count} * 100.0 / {total}"

# Cross-cube member reference
sql: "{users.count}"

# Column with CUBE variable (required in extended cubes)
sql: "{CUBE}.column_name"

# Join path reference (disambiguate diamond joins)
sql: "{orders.users.country}"

# cube.sql() — reference another cube's SQL
sql: "SELECT * FROM {other_cube.sql()} WHERE type = 'active'"

# Escape literal braces in YAML with backslash
sql: "CAST('\\{\"key\":\"value\"\\}'::JSON AS TEXT)"

Naming Conventions

  • Start with a letter, use only letters/numbers/underscore
  • Use snake_case: order_count, created_at, total_revenue
  • Avoid Python reserved words: from, return, yield

Core Best Practices

1. Always Define Primary Keys

Required on any cube with joins to prevent fanout in aggregations:

yaml
dimensions:
  - name: id
    sql: id
    type: number
    primary_key: true

2. Always Use {CUBE} in Extended Cubes

yaml
# CORRECT
sql: "{CUBE}.column_name"

# WRONG — breaks when cube is extended
sql: "base_cube.column_name"

3. Time Dimensions Must Be TIMESTAMP

For BigQuery, always wrap DATE columns:

yaml
dimensions:
  - name: created_at
    sql: "TIMESTAMP({CUBE}.created_at)"
    type: time

4. Prefer sql_table Over sql

yaml
# Preferred for simple tables
cubes:
  - name: orders
    sql_table: public.orders

# Use sql only for complex queries
cubes:
  - name: active_orders
    sql: "SELECT * FROM orders WHERE status = 'active'"

5. Measure Additivity

Only these types are additive (can be rolled up in pre-aggregations):

  • count, count_distinct_approx, sum, min, max

Non-additive (cannot be reliably pre-aggregated alone):

  • avg, count_distinct, number

For non-additive measures, decompose into additive leaf measures:

yaml
measures:
  - name: total_amount
    sql: amount
    type: sum
  - name: count
    type: count
  - name: avg_amount
    sql: "{total_amount} / NULLIF({count}, 0)"
    type: number

6. Join Direction Matters

Joins are directed LEFT JOINs. The declaring cube is the base table:

yaml
# orders LEFT JOIN customers — all orders included
cubes:
  - name: orders
    joins:
      - name: customers
        relationship: many_to_one
        sql: "{CUBE}.customer_id = {customers.id}"

Avoid bidirectional joins (both cubes joining to each other) — causes ambiguity.

7. Views for API Consumers

Expose cubes via views for governance. Views don't define pre-aggregations — they reuse underlying cubes':

yaml
views:
  - name: orders_overview
    cubes:
      - join_path: orders
        includes:
          - count
          - total_amount
          - created_at
      - join_path: orders.customers
        prefix: true
        includes:
          - name
          - email

8. Pre-aggregation Essentials

Include ALL measures and dimensions that will be queried:

yaml
pre_aggregations:
  - name: main
    measures:
      - count
      - total_amount
    dimensions:
      - status
    time_dimension: created_at
    granularity: day
    partition_granularity: month
    refresh_key:
      every: 1 hour

Common Patterns Quick Reference

Filtered Measure

yaml
measures:
  - name: completed_count
    type: count
    filters:
      - sql: "{CUBE}.status = 'completed'"

Calculated Measure (ratio)

yaml
measures:
  - name: completion_rate
    sql: "100.0 * {completed_count} / NULLIF({count}, 0)"
    type: number
    format: percent

Subquery Dimension

yaml
dimensions:
  - name: order_count
    sql: "{orders.count}"
    type: number
    sub_query: true

Many-to-Many via Junction

yaml
cubes:
  - name: posts
    joins:
      - name: post_tags
        relationship: one_to_many
        sql: "{CUBE}.id = {post_tags.post_id}"
  - name: post_tags
    joins:
      - name: tags
        relationship: many_to_one
        sql: "{CUBE}.tag_id = {tags.id}"

Segment (reusable filter)

yaml
segments:
  - name: active_users
    sql: "{CUBE}.status = 'active' AND {CUBE}.last_login > CURRENT_DATE - 30"

Common Pitfalls

  1. Missing primary key on joined cube — causes incorrect aggregations
  2. Bare column name for time dimensions on BigQuery — causes Could not cast literal "UTC" error; wrap with TIMESTAMP()
  3. Using primary_key: true on time dimensions — DataFusion (SQL API) silently drops them
  4. Diamond joins without join paths — ambiguous results; use join_path in views or join paths in member references
  5. GROUP BY in cube SQL — cubes should return plain rows, not aggregated data
  6. Heavy use of FILTER_PARAMS — bad practice; use only for optimizer hints and partition filtering
  7. Non-additive measures in pre-aggregations — decompose into leaf measures first
  8. SQL reserved words in column names — DataFusion silently drops them; use SQL aliases