AgentSkillsCN

Ddia

DIA

SKILL.md

Designing Data-Intensive Applications Skill

Reference for distributed systems and data architecture concepts from Martin Kleppmann's "Designing Data-Intensive Applications."

Activation Triggers

Use this skill when discussing:

  • Database selection and data modeling
  • Replication and high availability
  • Partitioning/sharding strategies
  • Distributed transactions
  • Consistency models and guarantees
  • Stream vs batch processing
  • Event sourcing and CQRS

Quick Reference

Data Models

ModelBest ForTrade-offs
RelationalComplex queries, joins, ACIDSchema rigidity, scaling writes
DocumentHierarchical data, flexibilityPoor joins, denormalization
GraphHighly connected dataSpecialized queries, complexity
Wide-ColumnTime series, analyticsLimited query patterns

Storage Engines

EngineOptimized ForExamples
B-TreeRead-heavy, random accessPostgreSQL, MySQL
LSM-TreeWrite-heavy, sequentialCassandra, RocksDB, LevelDB
Column StoreAnalytics, aggregationsClickHouse, Parquet

Replication Strategies

StrategyConsistencyAvailabilityUse Case
Single LeaderStrongMediumTraditional RDBMS
Multi-LeaderEventualHighMulti-datacenter
LeaderlessEventualHighestHigh availability

Partitioning Strategies

StrategyDescriptionProsCons
RangePartition by key rangesEfficient range queriesHot spots
HashPartition by hash of keyEven distributionNo range queries
CompositeCombine range + hashBalancedComplexity

Consistency Models

ModelGuaranteePerformance
LinearizableStrongest (appears sequential)Slowest
SequentialOperations ordered per clientMedium
CausalCause-effect preservedGood
EventualWill converge eventuallyFastest

Transaction Isolation Levels

LevelDirty ReadNon-RepeatablePhantom
Read Uncommitted
Read Committed
Repeatable Read
Serializable

CAP Theorem

"In the presence of a network partition, choose Consistency OR Availability."

ChoiceBehaviorExamples
CPReject requests if can't guarantee consistencyZooKeeper, HBase
APAccept requests, allow inconsistencyCassandra, DynamoDB

Batch vs Stream Processing

AspectBatchStream
LatencyHigh (hours/days)Low (seconds/minutes)
DataBounded, completeUnbounded, continuous
ProcessingMapReduce, SparkKafka, Flink, Storm
Use CaseAnalytics, ETLReal-time alerts, dashboards

Directory Structure

code
ddia/
├── SKILL.md
├── data-models/
│   ├── relational.md
│   ├── document.md
│   └── graph.md
├── storage/
│   ├── b-trees.md
│   ├── lsm-trees.md
│   └── column-storage.md
├── replication/
│   ├── leader-follower.md
│   ├── multi-leader.md
│   └── leaderless.md
├── partitioning/
│   ├── strategies.md
│   └── rebalancing.md
├── transactions/
│   ├── acid.md
│   ├── isolation-levels.md
│   └── distributed-transactions.md
├── consistency/
│   ├── models.md
│   └── linearizability.md
├── consensus/
│   └── algorithms.md
└── processing/
    ├── batch.md
    ├── stream.md
    └── event-sourcing.md

Usage Examples

Choosing a Database

code
Question: "Should I use PostgreSQL or MongoDB?"

Consider:
- Data relationships → See data-models/
- Query patterns → See storage/
- Scale requirements → See partitioning/
- Consistency needs → See consistency/

Designing for Scale

code
Question: "How do I handle millions of users?"

Consider:
- Read scaling → See replication/leader-follower.md
- Write scaling → See partitioning/strategies.md
- Geographic distribution → See replication/multi-leader.md

Handling Failures

code
Question: "What happens when a node fails?"

Consider:
- Data durability → See replication/
- Consistency trade-offs → See consistency/models.md
- Recovery → See consensus/algorithms.md

Based on concepts from "Designing Data-Intensive Applications" by Martin Kleppmann.