CQL Type System & Schema Handling
This skill provides guidance on implementing Cassandra CQL type system with schema-provided deserialization.
When to Use This Skill
- •Implementing CQL type deserializers
- •Parsing collection types (list, set, map)
- •Handling User-Defined Types (UDTs)
- •Working with frozen vs non-frozen types
- •Tuple deserialization
- •Schema validation
- •Type-correct data generation
Core Principles
Schema-Provided Deserialization
Per PRD: schema passed in, not inferred
// Schema provides type information
fn deserialize_cell(
data: &[u8],
column_type: &CqlType, // From schema
) -> Result<CqlValue>
Never try to infer type from data alone - always use schema.
CQL Type Categories
1. Primitive Types
Fixed-Size Primitives
- •
boolean- 1 byte (0x00 or 0x01) - •
tinyint- 1 byte signed - •
smallint- 2 bytes signed, big-endian - •
int- 4 bytes signed, big-endian - •
bigint- 8 bytes signed, big-endian - •
float- 4 bytes IEEE 754 - •
double- 8 bytes IEEE 754 - •
date- 4 bytes (days since epoch) - •
time- 8 bytes (nanoseconds since midnight)
Variable-Size Primitives
- •
text/varchar- UTF-8 encoded string - •
blob- raw bytes - •
ascii- ASCII-only string
Special Primitives
- •
uuid/timeuuid- 16 bytes - •
inet- 4 bytes (IPv4) or 16 bytes (IPv6) - •
varint- variable-length big integer - •
decimal- scale (4 bytes) + unscaled varint - •
duration- months, days, nanoseconds (3 VInts) - •
timestamp- 8 bytes (milliseconds since Unix epoch)
2. Collection Types
See collections-and-udts.md for detailed format.
Collection Format:
[4 bytes: element_count (big-endian)]
[for each element:]
[4 bytes: element_size (big-endian)]
[bytes: element_data]
Types:
- •
list<T>- Ordered, allows duplicates - •
set<T>- Unordered, no duplicates - •
map<K,V>- Key-value pairs
3. Tuple Types
Format:
[element_1_data] [element_2_data] ...
No size prefix - elements serialized back-to-back. Each element uses its type's serialization.
4. User-Defined Types (UDTs)
Format:
[for each field in schema order:]
[4 bytes: field_size (-1 for null, 0 for empty, >0 for data)]
[if size > 0:]
[bytes: field_data]
UDT schema defines field names and types.
5. Frozen vs Non-Frozen
Frozen types:
- •Serialized as single blob
- •Cannot update individual elements
- •Used in primary keys
- •Nested collections must be frozen
Non-frozen collections:
- •Can update individual elements
- •Only allowed at top level (not nested)
- •Uses tombstones for deletions
Type Deserialization Patterns
Zero-Copy Pattern
use bytes::Bytes;
fn deserialize_text(data: Bytes) -> Result<String> {
// Zero-copy: validate UTF-8 then wrap
let s = std::str::from_utf8(&data)?;
Ok(s.to_string()) // Only copy if needed
}
fn deserialize_blob(data: Bytes) -> Result<Bytes> {
// Zero-copy: just return the slice
Ok(data)
}
Length-Prefixed Pattern
fn deserialize_length_prefixed(data: &[u8]) -> Result<(Bytes, &[u8])> {
if data.len() < 4 {
return Err(Error::NotEnoughBytes);
}
let size = i32::from_be_bytes([data[0], data[1], data[2], data[3]]);
if size < 0 {
return Ok((Bytes::new(), &data[4..])); // Null
}
let size = size as usize;
if data.len() < 4 + size {
return Err(Error::NotEnoughBytes);
}
let value = Bytes::copy_from_slice(&data[4..4 + size]);
let remaining = &data[4 + size..];
Ok((value, remaining))
}
Collection Pattern
fn deserialize_list(
data: &[u8],
element_type: &CqlType,
) -> Result<Vec<CqlValue>> {
let count = i32::from_be_bytes([data[0], data[1], data[2], data[3]]) as usize;
let mut offset = 4;
let mut elements = Vec::with_capacity(count);
for _ in 0..count {
let (element_data, remaining) = deserialize_length_prefixed(&data[offset..])?;
let element = deserialize_value(&element_data, element_type)?;
elements.push(element);
offset = data.len() - remaining.len();
}
Ok(elements)
}
Schema Handling
Schema Sources
- •Statistics.db: Serialization header with column definitions
- •System tables:
system_schema.tables,system_schema.columns - •CQL schema file: For test data generation
Schema Representation
struct TableSchema {
keyspace: String,
table: String,
partition_keys: Vec<ColumnDef>,
clustering_keys: Vec<ColumnDef>,
regular_columns: Vec<ColumnDef>,
static_columns: Vec<ColumnDef>,
}
struct ColumnDef {
name: String,
cql_type: CqlType,
}
enum CqlType {
// Primitives
Boolean,
Int,
BigInt,
Text,
Uuid,
Timestamp,
// ... more primitives
// Collections
List(Box<CqlType>),
Set(Box<CqlType>),
Map(Box<CqlType>, Box<CqlType>),
// Complex
Tuple(Vec<CqlType>),
Udt(UdtDef),
// Modifiers
Frozen(Box<CqlType>),
}
PRD Alignment
Supports Milestone M1 (Core Reading Library):
- •All CQL types including collections & UDTs
- •Schema-provided deserialization (not inferred)
- •Zero-copy patterns where possible
Supports Milestone M5 (Write Support):
- •Type-correct serialization
- •Schema validation
Common Pitfalls
1. Inferring Types
❌ Wrong: Look at data to guess type ✅ Right: Use schema to know type
2. Copying Unnecessarily
❌ Wrong: Vec<u8> for every field
✅ Right: Bytes with zero-copy slicing
3. Ignoring Null Handling
❌ Wrong: Assume all fields present ✅ Right: Check for null (-1 size prefix)
4. Frozen Semantics
❌ Wrong: Try to update frozen collection elements ✅ Right: Replace entire frozen value
5. Nested Collections
❌ Wrong: Allow non-frozen nested collections ✅ Right: Nested collections must be frozen
Type System References
Detailed specifications in:
- •cql-types-reference.md - Complete type catalog
- •collections-and-udts.md - Collection and UDT formats
Testing
Generate type-correct test data:
# Use test-data-management skill for Docker-based generation cd test-data ./scripts/start-clean.sh ./scripts/generate.sh
Validate parsing against sstabledump:
sstabledump test-data/datasets/sstables/keyspace/table/*.db
Next Steps
When adding new type support:
- •Add to
CqlTypeenum - •Implement deserializer with zero-copy where possible
- •Add serializer (for M5 write support)
- •Create property tests with edge cases
- •Generate test data with type
- •Validate against sstabledump