Test Data Generation & Validation

This skill provides guidance on generating real Cassandra 5.0 test data and validating parsing correctness.

When to Use This Skill

•Generating test data with specific schemas
•Creating test fixtures for property tests
•Exporting SSTables from Cassandra
•Validating parsed data against sstabledump
•Managing test datasets
•Creating reproducible test scenarios

Overview

CQLite uses real Cassandra 5.0 instances to generate test data, ensuring:

•Format correctness (real Cassandra writes)
•Edge case coverage (nulls, empty values, large values)
•Compression validation (actual compressed SSTables)
•Schema variety (all CQL types)

Test Data Workflow

See dataset-generation.md for complete workflow details.

Quick Start

bash

cd test-data

# 1. Start clean Cassandra 5 with schemas
./scripts/start-clean.sh

# 2. Generate data (N rows per table)
ROWS=1000 ./scripts/generate.sh

# 3. Export SSTables
./scripts/export.sh

# 4. Shutdown and clean volumes
./scripts/shutdown-clean.sh

Generation Scripts

start-clean.sh

Starts Cassandra 5.0 container and applies schemas.

What it does:

•Starts cassandra-5-0 container via docker-compose
•Waits for Cassandra to be healthy
•Applies schemas from schemas/core.list
•Verifies keyspaces and tables created

Environment variables:

•SCHEMA_SET=core - Use curated schema list (default)
•SCHEMA_SET=all - Use all *.cql files

Example:

bash

# Use default core schemas
./scripts/start-clean.sh

# Use all schemas
SCHEMA_SET=all ./scripts/start-clean.sh

generate.sh

Generates test data using Python data generator.

What it does:

•Connects to running Cassandra container
•Generates type-correct data for each table
•Inserts rows using prepared statements
•Flushes memtables to SSTables
•Produces metadata.yml with row counts

Environment variables:

•ROWS=N - Rows per table (default: varies by SCALE)
•TABLES=table1,table2 - Generate for specific tables only
•SCALE=SMALL|MEDIUM|LARGE - Preset sizes

Example:

bash

# Generate 1000 rows per table
ROWS=1000 ./scripts/generate.sh

# Generate only for specific tables
TABLES=simple_table,collection_table ROWS=500 ./scripts/generate.sh

# Use LARGE scale preset
SCALE=LARGE ./scripts/generate.sh

export.sh

Exports SSTables from Cassandra data directory.

What it does:

•Stops Cassandra to ensure consistent snapshot
•Copies SSTables from container to datasets/sstables/
•Preserves directory structure (keyspace/table/files)
•Copies metadata.yml
•Creates metadata about dataset

Output structure:

code

test-data/datasets/
├── metadata.yml          # Generated by generate.sh
├── sstables/
│   ├── test_basic/
│   │   └── simple_table/
│   │       ├── *-Data.db
│   │       ├── *-Index.db
│   │       ├── *-Statistics.db
│   │       ├── *-Summary.db
│   │       └── *-TOC.txt
│   ├── test_collections/
│   └── test_timeseries/

shutdown-clean.sh

Stops Cassandra and removes Docker volumes.

What it does:

•Stops all containers
•Removes Docker volumes (clean slate)
•Prepares for next generation cycle

Use when:

•Done with current dataset
•Want to regenerate from scratch
•Cleaning up after tests

Test Schemas

Schemas in test-data/schemas/:

basic-types.cql

Simple table with all primitive types:

•Partition key: uuid
•No clustering
•Columns: int, text, timestamp, boolean, etc.

collections.cql

Collection types:

•list<int>
•set<text>
•map<text, int>
•Nested frozen collections

time-series.cql

Time-series pattern:

•Partition key: sensor_id
•Clustering: timestamp (DESC)
•Columns: temperature, humidity, pressure

wide-rows.cql

Wide partition testing:

•Single partition key
•Many clustering rows (1000+)
•Tests pagination and offset handling

Custom Schemas

Add your own:

bash

# Create schema
echo "CREATE TABLE test_keyspace.my_table (...);" > schemas/my-schema.cql

# Add to core.list
echo "my-schema.cql" >> schemas/core.list

# Generate
./scripts/start-clean.sh
./scripts/generate.sh

Validation Workflow

See validation-workflow.md for complete validation process.

Validate Against sstabledump

bash

# 1. Generate sstabledump reference
sstabledump test-data/datasets/sstables/keyspace/table/*-Data.db \
    > reference.json

# 2. Parse with cqlite
cargo run --bin cqlite -- \
    --data-dir test-data/datasets/sstables/keyspace/table \
    --schema test-data/schemas/schema.cql \
    --out json > cqlite.json

# 3. Compare (ignoring formatting)
jq -S '.' reference.json > ref-sorted.json
jq -S '.' cqlite.json > cql-sorted.json
diff ref-sorted.json cql-sorted.json

Automated Validation

Run validation script:

bash

# Validate all test tables
cargo test --test sstable_validation

# Validate specific table
cargo test --test sstable_validation -- simple_table

Property Testing

Generate random data for property tests:

rust

use proptest::prelude::*;

proptest! {
    #[test]
    fn test_row_parsing_roundtrip(
        partition_key in any::<i32>(),
        text_value in "\\PC*",  // Any valid unicode
        int_value in any::<i32>(),
    ) {
        // Generate test data in Cassandra
        insert_test_row(partition_key, &text_value, int_value)?;
        flush_memtable()?;
        
        // Parse with cqlite
        let parsed = parse_sstable()?;
        
        // Validate roundtrip
        assert_eq!(parsed.get_int("partition_key"), partition_key);
        assert_eq!(parsed.get_text("text_col"), text_value);
        assert_eq!(parsed.get_int("int_col"), int_value);
    }
}

Dataset Packaging

Package datasets for CI or distribution:

bash

# Package current dataset
./scripts/package_datasets.sh

# Output: test-data/cqlite-test-data-v5.0-<date>.tar.gz

Contents:

•All SSTables
•metadata.yml
•Schema files
•README with generation parameters

CI Integration

Smoke Test

Quick validation in CI:

bash

# Use packaged dataset
tar xzf cqlite-test-data-v5.0.tar.gz

# Run core tests
./scripts/ci-one-shot-smoke.sh

# Validates:
# - Basic parsing
# - All CQL types
# - Compression
# - Collections

See test-data/scripts/CI_SMOKE_TEST_USAGE.md for details.

Common Scenarios

Scenario 1: Test New CQL Type

bash

# 1. Add column to schema
echo "ALTER TABLE test_basic.simple_table ADD duration_col duration;" \
    >> schemas/basic-types.cql

# 2. Regenerate data
./scripts/start-clean.sh
./scripts/generate.sh
./scripts/export.sh

# 3. Validate parsing
cargo test --test sstable_validation

Scenario 2: Test Large Values

bash

# Generate with specific row size
ROWS=100 SCALE=LARGE ./scripts/generate.sh

# Validates:
# - Large text values (1MB+)
# - Large blob values
# - Large collections (1000+ elements)

Scenario 3: Test Edge Cases

python

# Modify generate_comprehensive_test_data.py
def generate_edge_cases(session):
    # Null values
    session.execute("INSERT INTO table (pk) VALUES (?)", [uuid.uuid4()])
    
    # Empty collections
    session.execute("INSERT INTO table (pk, tags) VALUES (?, [])", 
                   [uuid.uuid4()])
    
    # Empty strings
    session.execute("INSERT INTO table (pk, name) VALUES (?, '')", 
                   [uuid.uuid4()])

PRD Alignment

Supports Milestone M1 (Core Reading Library):

•95% test coverage goal
•All CQL types validated
•Real Cassandra data ensures format correctness

Supports All Milestones:

•Regression testing with frozen datasets
•Property-based testing for edge cases
•CI integration for PR validation

Troubleshooting

Cassandra Won't Start

bash

# Check logs
docker logs cassandra-5-0

# Common issue: Port 9042 in use
lsof -i :9042
# Kill process or change port in docker-compose-cassandra5.yml

Generation Fails

bash

# Check generator logs
cat test-data/logs/data_generation.log

# Verify schema applied
docker exec cassandra-5-0 cqlsh -e "DESCRIBE KEYSPACES;"

Export Produces No Files

bash

# Verify data exists in container
docker exec cassandra-5-0 ls -la /var/lib/cassandra/data/

# Check if flush happened
docker logs cassandra-5-0 | grep flush

Dataset Repository

Packaged datasets available at:

code

https://github.com/pmcfadin/cqlite/releases/tag/test-data-v5.0

Download for:

•CI without Docker
•Reproducible benchmarks
•Offline development

Next Steps

When creating new tests:

•Design schema in schemas/
•Generate data with generate.sh
•Export SSTables with export.sh
•Write parser test
•Validate with sstabledump
•Add to CI smoke test suite

See documentation:

•dataset-generation.md - Full workflow
•validation-workflow.md - Validation process