Test Data Generation & Validation
This skill provides guidance on generating real Cassandra 5.0 test data and validating parsing correctness.
When to Use This Skill
- •Generating test data with specific schemas
- •Creating test fixtures for property tests
- •Exporting SSTables from Cassandra
- •Validating parsed data against sstabledump
- •Managing test datasets
- •Creating reproducible test scenarios
Overview
CQLite uses real Cassandra 5.0 instances to generate test data, ensuring:
- •Format correctness (real Cassandra writes)
- •Edge case coverage (nulls, empty values, large values)
- •Compression validation (actual compressed SSTables)
- •Schema variety (all CQL types)
Test Data Workflow
See dataset-generation.md for complete workflow details.
Quick Start
cd test-data # 1. Start clean Cassandra 5 with schemas ./scripts/start-clean.sh # 2. Generate data (N rows per table) ROWS=1000 ./scripts/generate.sh # 3. Export SSTables ./scripts/export.sh # 4. Shutdown and clean volumes ./scripts/shutdown-clean.sh
Generation Scripts
start-clean.sh
Starts Cassandra 5.0 container and applies schemas.
What it does:
- •Starts
cassandra-5-0container via docker-compose - •Waits for Cassandra to be healthy
- •Applies schemas from
schemas/core.list - •Verifies keyspaces and tables created
Environment variables:
- •
SCHEMA_SET=core- Use curated schema list (default) - •
SCHEMA_SET=all- Use all *.cql files
Example:
# Use default core schemas ./scripts/start-clean.sh # Use all schemas SCHEMA_SET=all ./scripts/start-clean.sh
generate.sh
Generates test data using Python data generator.
What it does:
- •Connects to running Cassandra container
- •Generates type-correct data for each table
- •Inserts rows using prepared statements
- •Flushes memtables to SSTables
- •Produces metadata.yml with row counts
Environment variables:
- •
ROWS=N- Rows per table (default: varies by SCALE) - •
TABLES=table1,table2- Generate for specific tables only - •
SCALE=SMALL|MEDIUM|LARGE- Preset sizes
Example:
# Generate 1000 rows per table ROWS=1000 ./scripts/generate.sh # Generate only for specific tables TABLES=simple_table,collection_table ROWS=500 ./scripts/generate.sh # Use LARGE scale preset SCALE=LARGE ./scripts/generate.sh
export.sh
Exports SSTables from Cassandra data directory.
What it does:
- •Stops Cassandra to ensure consistent snapshot
- •Copies SSTables from container to
datasets/sstables/ - •Preserves directory structure (keyspace/table/files)
- •Copies metadata.yml
- •Creates metadata about dataset
Output structure:
test-data/datasets/ ├── metadata.yml # Generated by generate.sh ├── sstables/ │ ├── test_basic/ │ │ └── simple_table/ │ │ ├── *-Data.db │ │ ├── *-Index.db │ │ ├── *-Statistics.db │ │ ├── *-Summary.db │ │ └── *-TOC.txt │ ├── test_collections/ │ └── test_timeseries/
shutdown-clean.sh
Stops Cassandra and removes Docker volumes.
What it does:
- •Stops all containers
- •Removes Docker volumes (clean slate)
- •Prepares for next generation cycle
Use when:
- •Done with current dataset
- •Want to regenerate from scratch
- •Cleaning up after tests
Test Schemas
Schemas in test-data/schemas/:
basic-types.cql
Simple table with all primitive types:
- •Partition key: uuid
- •No clustering
- •Columns: int, text, timestamp, boolean, etc.
collections.cql
Collection types:
- •list<int>
- •set<text>
- •map<text, int>
- •Nested frozen collections
time-series.cql
Time-series pattern:
- •Partition key: sensor_id
- •Clustering: timestamp (DESC)
- •Columns: temperature, humidity, pressure
wide-rows.cql
Wide partition testing:
- •Single partition key
- •Many clustering rows (1000+)
- •Tests pagination and offset handling
Custom Schemas
Add your own:
# Create schema echo "CREATE TABLE test_keyspace.my_table (...);" > schemas/my-schema.cql # Add to core.list echo "my-schema.cql" >> schemas/core.list # Generate ./scripts/start-clean.sh ./scripts/generate.sh
Validation Workflow
See validation-workflow.md for complete validation process.
Validate Against sstabledump
# 1. Generate sstabledump reference
sstabledump test-data/datasets/sstables/keyspace/table/*-Data.db \
> reference.json
# 2. Parse with cqlite
cargo run --bin cqlite -- \
--data-dir test-data/datasets/sstables/keyspace/table \
--schema test-data/schemas/schema.cql \
--out json > cqlite.json
# 3. Compare (ignoring formatting)
jq -S '.' reference.json > ref-sorted.json
jq -S '.' cqlite.json > cql-sorted.json
diff ref-sorted.json cql-sorted.json
Automated Validation
Run validation script:
# Validate all test tables cargo test --test sstable_validation # Validate specific table cargo test --test sstable_validation -- simple_table
Property Testing
Generate random data for property tests:
use proptest::prelude::*;
proptest! {
#[test]
fn test_row_parsing_roundtrip(
partition_key in any::<i32>(),
text_value in "\\PC*", // Any valid unicode
int_value in any::<i32>(),
) {
// Generate test data in Cassandra
insert_test_row(partition_key, &text_value, int_value)?;
flush_memtable()?;
// Parse with cqlite
let parsed = parse_sstable()?;
// Validate roundtrip
assert_eq!(parsed.get_int("partition_key"), partition_key);
assert_eq!(parsed.get_text("text_col"), text_value);
assert_eq!(parsed.get_int("int_col"), int_value);
}
}
Dataset Packaging
Package datasets for CI or distribution:
# Package current dataset ./scripts/package_datasets.sh # Output: test-data/cqlite-test-data-v5.0-<date>.tar.gz
Contents:
- •All SSTables
- •metadata.yml
- •Schema files
- •README with generation parameters
CI Integration
Smoke Test
Quick validation in CI:
# Use packaged dataset tar xzf cqlite-test-data-v5.0.tar.gz # Run core tests ./scripts/ci-one-shot-smoke.sh # Validates: # - Basic parsing # - All CQL types # - Compression # - Collections
See test-data/scripts/CI_SMOKE_TEST_USAGE.md for details.
Common Scenarios
Scenario 1: Test New CQL Type
# 1. Add column to schema
echo "ALTER TABLE test_basic.simple_table ADD duration_col duration;" \
>> schemas/basic-types.cql
# 2. Regenerate data
./scripts/start-clean.sh
./scripts/generate.sh
./scripts/export.sh
# 3. Validate parsing
cargo test --test sstable_validation
Scenario 2: Test Large Values
# Generate with specific row size ROWS=100 SCALE=LARGE ./scripts/generate.sh # Validates: # - Large text values (1MB+) # - Large blob values # - Large collections (1000+ elements)
Scenario 3: Test Edge Cases
# Modify generate_comprehensive_test_data.py
def generate_edge_cases(session):
# Null values
session.execute("INSERT INTO table (pk) VALUES (?)", [uuid.uuid4()])
# Empty collections
session.execute("INSERT INTO table (pk, tags) VALUES (?, [])",
[uuid.uuid4()])
# Empty strings
session.execute("INSERT INTO table (pk, name) VALUES (?, '')",
[uuid.uuid4()])
PRD Alignment
Supports Milestone M1 (Core Reading Library):
- •95% test coverage goal
- •All CQL types validated
- •Real Cassandra data ensures format correctness
Supports All Milestones:
- •Regression testing with frozen datasets
- •Property-based testing for edge cases
- •CI integration for PR validation
Troubleshooting
Cassandra Won't Start
# Check logs docker logs cassandra-5-0 # Common issue: Port 9042 in use lsof -i :9042 # Kill process or change port in docker-compose-cassandra5.yml
Generation Fails
# Check generator logs cat test-data/logs/data_generation.log # Verify schema applied docker exec cassandra-5-0 cqlsh -e "DESCRIBE KEYSPACES;"
Export Produces No Files
# Verify data exists in container docker exec cassandra-5-0 ls -la /var/lib/cassandra/data/ # Check if flush happened docker logs cassandra-5-0 | grep flush
Dataset Repository
Packaged datasets available at:
https://github.com/pmcfadin/cqlite/releases/tag/test-data-v5.0
Download for:
- •CI without Docker
- •Reproducible benchmarks
- •Offline development
Next Steps
When creating new tests:
- •Design schema in
schemas/ - •Generate data with
generate.sh - •Export SSTables with
export.sh - •Write parser test
- •Validate with sstabledump
- •Add to CI smoke test suite
See documentation:
- •dataset-generation.md - Full workflow
- •validation-workflow.md - Validation process