CQL Type System & Schema Handling

This skill provides guidance on implementing Cassandra CQL type system with schema-provided deserialization.

When to Use This Skill

•Implementing CQL type deserializers
•Parsing collection types (list, set, map)
•Handling User-Defined Types (UDTs)
•Working with frozen vs non-frozen types
•Tuple deserialization
•Schema validation
•Type-correct data generation

Core Principles

Schema-Provided Deserialization

Per PRD: schema passed in, not inferred

rust

// Schema provides type information
fn deserialize_cell(
    data: &[u8],
    column_type: &CqlType,  // From schema
) -> Result<CqlValue>

Never try to infer type from data alone - always use schema.

CQL Type Categories

1. Primitive Types

Fixed-Size Primitives

•boolean - 1 byte (0x00 or 0x01)
•tinyint - 1 byte signed
•smallint - 2 bytes signed, big-endian
•int - 4 bytes signed, big-endian
•bigint - 8 bytes signed, big-endian
•float - 4 bytes IEEE 754
•double - 8 bytes IEEE 754
•date - 4 bytes (days since epoch)
•time - 8 bytes (nanoseconds since midnight)

Variable-Size Primitives

•text/varchar - UTF-8 encoded string
•blob - raw bytes
•ascii - ASCII-only string

Special Primitives

•uuid/timeuuid - 16 bytes
•inet - 4 bytes (IPv4) or 16 bytes (IPv6)
•varint - variable-length big integer
•decimal - scale (4 bytes) + unscaled varint
•duration - months, days, nanoseconds (3 VInts)
•timestamp - 8 bytes (milliseconds since Unix epoch)

2. Collection Types

See collections-and-udts.md for detailed format.

Collection Format:

code

[4 bytes: element_count (big-endian)]
[for each element:]
    [4 bytes: element_size (big-endian)]
    [bytes: element_data]

Types:

•list<T> - Ordered, allows duplicates
•set<T> - Unordered, no duplicates
•map<K,V> - Key-value pairs

3. Tuple Types

Format:

code

[element_1_data]
[element_2_data]
...

No size prefix - elements serialized back-to-back. Each element uses its type's serialization.

4. User-Defined Types (UDTs)

Format:

code

[for each field in schema order:]
    [4 bytes: field_size (-1 for null, 0 for empty, >0 for data)]
    [if size > 0:]
        [bytes: field_data]

UDT schema defines field names and types.

5. Frozen vs Non-Frozen

Frozen types:

•Serialized as single blob
•Cannot update individual elements
•Used in primary keys
•Nested collections must be frozen

Non-frozen collections:

•Can update individual elements
•Only allowed at top level (not nested)
•Uses tombstones for deletions

Type Deserialization Patterns

Zero-Copy Pattern

rust

use bytes::Bytes;

fn deserialize_text(data: Bytes) -> Result<String> {
    // Zero-copy: validate UTF-8 then wrap
    let s = std::str::from_utf8(&data)?;
    Ok(s.to_string())  // Only copy if needed
}

fn deserialize_blob(data: Bytes) -> Result<Bytes> {
    // Zero-copy: just return the slice
    Ok(data)
}

Length-Prefixed Pattern

rust

fn deserialize_length_prefixed(data: &[u8]) -> Result<(Bytes, &[u8])> {
    if data.len() < 4 {
        return Err(Error::NotEnoughBytes);
    }
    
    let size = i32::from_be_bytes([data[0], data[1], data[2], data[3]]);
    
    if size < 0 {
        return Ok((Bytes::new(), &data[4..]));  // Null
    }
    
    let size = size as usize;
    if data.len() < 4 + size {
        return Err(Error::NotEnoughBytes);
    }
    
    let value = Bytes::copy_from_slice(&data[4..4 + size]);
    let remaining = &data[4 + size..];
    
    Ok((value, remaining))
}

Collection Pattern

rust

fn deserialize_list(
    data: &[u8],
    element_type: &CqlType,
) -> Result<Vec<CqlValue>> {
    let count = i32::from_be_bytes([data[0], data[1], data[2], data[3]]) as usize;
    let mut offset = 4;
    let mut elements = Vec::with_capacity(count);
    
    for _ in 0..count {
        let (element_data, remaining) = deserialize_length_prefixed(&data[offset..])?;
        let element = deserialize_value(&element_data, element_type)?;
        elements.push(element);
        offset = data.len() - remaining.len();
    }
    
    Ok(elements)
}

Schema Handling

Schema Sources

•Statistics.db: Serialization header with column definitions
•System tables: system_schema.tables, system_schema.columns
•CQL schema file: For test data generation

Schema Representation

rust

struct TableSchema {
    keyspace: String,
    table: String,
    partition_keys: Vec<ColumnDef>,
    clustering_keys: Vec<ColumnDef>,
    regular_columns: Vec<ColumnDef>,
    static_columns: Vec<ColumnDef>,
}

struct ColumnDef {
    name: String,
    cql_type: CqlType,
}

enum CqlType {
    // Primitives
    Boolean,
    Int,
    BigInt,
    Text,
    Uuid,
    Timestamp,
    // ... more primitives
    
    // Collections
    List(Box<CqlType>),
    Set(Box<CqlType>),
    Map(Box<CqlType>, Box<CqlType>),
    
    // Complex
    Tuple(Vec<CqlType>),
    Udt(UdtDef),
    
    // Modifiers
    Frozen(Box<CqlType>),
}

PRD Alignment

Supports Milestone M1 (Core Reading Library):

•All CQL types including collections & UDTs
•Schema-provided deserialization (not inferred)
•Zero-copy patterns where possible

Supports Milestone M5 (Write Support):

•Type-correct serialization
•Schema validation

Common Pitfalls

1. Inferring Types

❌ Wrong: Look at data to guess type ✅ Right: Use schema to know type

2. Copying Unnecessarily

❌ Wrong: Vec<u8> for every field ✅ Right: Bytes with zero-copy slicing

3. Ignoring Null Handling

❌ Wrong: Assume all fields present ✅ Right: Check for null (-1 size prefix)

4. Frozen Semantics

❌ Wrong: Try to update frozen collection elements ✅ Right: Replace entire frozen value

5. Nested Collections

❌ Wrong: Allow non-frozen nested collections ✅ Right: Nested collections must be frozen

Type System References

Detailed specifications in:

•cql-types-reference.md - Complete type catalog
•collections-and-udts.md - Collection and UDT formats

Testing

Generate type-correct test data:

bash

# Use test-data-management skill for Docker-based generation
cd test-data
./scripts/start-clean.sh
./scripts/generate.sh

Validate parsing against sstabledump:

bash

sstabledump test-data/datasets/sstables/keyspace/table/*.db

Next Steps

When adding new type support:

•Add to CqlType enum
•Implement deserializer with zero-copy where possible
•Add serializer (for M5 write support)
•Create property tests with edge cases
•Generate test data with type
•Validate against sstabledump