Cube Data Modeling
Reference Index
Load the relevant reference file based on the task:
| Task | Reference File |
|---|---|
Define cubes (name, sql_table, sql, extends, refresh_key, data_source) or views (cubes, join_path, includes/excludes, folders) | cubes-and-views.md |
Define measures (type, sql, filters, rolling_window, time_shift, drill_members, format) or dimensions (type, sql, primary_key, sub_query, granularities, case) or hierarchies | measures-and-dimensions.md |
Configure joins (relationship, sql) or segments (sql) | joins-and-segments.md |
Configure pre-aggregations (type, measures, dimensions, time_dimension, granularity, partition_granularity, refresh_key, indexes) | pre-aggregations.md |
Implement calculated measures, proxy/subquery dimensions, multi-stage calculations, extends, polymorphic cubes, data blending, or calendar cubes | advanced-patterns.md |
Look up measure/dimension types, formats, or context variables (CUBE, FILTER_PARAMS, SQL_UTILS, COMPILE_CONTEXT) | types-formats-variables.md |
Use cube Python package or cube_dbt for dbt integration | packages.md |
YAML Syntax Essentials
File Structure
model/
├── cubes/
│ ├── orders.yml
│ └── users.yml
└── views/
└── orders_view.yml
References in YAML
# Column reference
sql: status
# Member reference (same cube)
sql: "{count} * 100.0 / {total}"
# Cross-cube member reference
sql: "{users.count}"
# Column with CUBE variable (required in extended cubes)
sql: "{CUBE}.column_name"
# Join path reference (disambiguate diamond joins)
sql: "{orders.users.country}"
# cube.sql() — reference another cube's SQL
sql: "SELECT * FROM {other_cube.sql()} WHERE type = 'active'"
# Escape literal braces in YAML with backslash
sql: "CAST('\\{\"key\":\"value\"\\}'::JSON AS TEXT)"
Naming Conventions
- •Start with a letter, use only letters/numbers/underscore
- •Use snake_case:
order_count,created_at,total_revenue - •Avoid Python reserved words:
from,return,yield
Core Best Practices
1. Always Define Primary Keys
Required on any cube with joins to prevent fanout in aggregations:
dimensions:
- name: id
sql: id
type: number
primary_key: true
2. Always Use {CUBE} in Extended Cubes
# CORRECT
sql: "{CUBE}.column_name"
# WRONG — breaks when cube is extended
sql: "base_cube.column_name"
3. Time Dimensions Must Be TIMESTAMP
For BigQuery, always wrap DATE columns:
dimensions:
- name: created_at
sql: "TIMESTAMP({CUBE}.created_at)"
type: time
4. Prefer sql_table Over sql
# Preferred for simple tables
cubes:
- name: orders
sql_table: public.orders
# Use sql only for complex queries
cubes:
- name: active_orders
sql: "SELECT * FROM orders WHERE status = 'active'"
5. Measure Additivity
Only these types are additive (can be rolled up in pre-aggregations):
- •
count,count_distinct_approx,sum,min,max
Non-additive (cannot be reliably pre-aggregated alone):
- •
avg,count_distinct,number
For non-additive measures, decompose into additive leaf measures:
measures:
- name: total_amount
sql: amount
type: sum
- name: count
type: count
- name: avg_amount
sql: "{total_amount} / NULLIF({count}, 0)"
type: number
6. Join Direction Matters
Joins are directed LEFT JOINs. The declaring cube is the base table:
# orders LEFT JOIN customers — all orders included
cubes:
- name: orders
joins:
- name: customers
relationship: many_to_one
sql: "{CUBE}.customer_id = {customers.id}"
Avoid bidirectional joins (both cubes joining to each other) — causes ambiguity.
7. Views for API Consumers
Expose cubes via views for governance. Views don't define pre-aggregations — they reuse underlying cubes':
views:
- name: orders_overview
cubes:
- join_path: orders
includes:
- count
- total_amount
- created_at
- join_path: orders.customers
prefix: true
includes:
- name
- email
8. Pre-aggregation Essentials
Include ALL measures and dimensions that will be queried:
pre_aggregations:
- name: main
measures:
- count
- total_amount
dimensions:
- status
time_dimension: created_at
granularity: day
partition_granularity: month
refresh_key:
every: 1 hour
Common Patterns Quick Reference
Filtered Measure
measures:
- name: completed_count
type: count
filters:
- sql: "{CUBE}.status = 'completed'"
Calculated Measure (ratio)
measures:
- name: completion_rate
sql: "100.0 * {completed_count} / NULLIF({count}, 0)"
type: number
format: percent
Subquery Dimension
dimensions:
- name: order_count
sql: "{orders.count}"
type: number
sub_query: true
Many-to-Many via Junction
cubes:
- name: posts
joins:
- name: post_tags
relationship: one_to_many
sql: "{CUBE}.id = {post_tags.post_id}"
- name: post_tags
joins:
- name: tags
relationship: many_to_one
sql: "{CUBE}.tag_id = {tags.id}"
Segment (reusable filter)
segments:
- name: active_users
sql: "{CUBE}.status = 'active' AND {CUBE}.last_login > CURRENT_DATE - 30"
Common Pitfalls
- •Missing primary key on joined cube — causes incorrect aggregations
- •Bare column name for time dimensions on BigQuery — causes
Could not cast literal "UTC"error; wrap withTIMESTAMP() - •Using
primary_key: trueon time dimensions — DataFusion (SQL API) silently drops them - •Diamond joins without join paths — ambiguous results; use
join_pathin views or join paths in member references - •GROUP BY in cube SQL — cubes should return plain rows, not aggregated data
- •Heavy use of FILTER_PARAMS — bad practice; use only for optimizer hints and partition filtering
- •Non-additive measures in pre-aggregations — decompose into leaf measures first
- •SQL reserved words in column names — DataFusion silently drops them; use SQL aliases