NoSQL Data Modeling

Database Selection Table

Factor	MongoDB	DynamoDB	Redis	Firestore
Data model	Document (JSON)	Key-value + document	Key-value + data structures	Document (nested)
Query flexibility	High (ad-hoc queries)	Low (key-based only)	Low (key-based)	Medium (indexed fields)
Scale model	Sharded clusters	Fully managed, infinite	Single-node or cluster	Fully managed
Consistency	Tunable (strong or eventual)	Tunable per-request	Strong (single node)	Strong within entity group
Cost model	Self-host or Atlas	Pay per RCU/WCU	Memory-based	Pay per read/write ops
Best for	General purpose, flexible schemas	Predictable high-scale workloads	Caching, sessions, leaderboards	Mobile/web apps, real-time sync
Avoid when	Need ACID joins	Need ad-hoc queries	Data > memory	Need complex queries
Max item size	16 MB	400 KB	512 MB (value)	1 MB

Access-Pattern-Driven Design

NoSQL design is backwards from relational. Start with queries, not entities.

Step-by-Step Process

code

1. List ALL access patterns (reads and writes)
2. Estimate frequency and latency requirements per pattern
3. Choose primary key to satisfy the most critical patterns
4. Design secondary indexes for remaining patterns
5. Denormalize data to avoid joins
6. Accept data duplication as a tradeoff for read performance

Example: E-Commerce

code

Access Patterns:
  1. Get order by order_id                    (100k/day, <10ms)
  2. Get all orders for a user                (50k/day, <50ms)
  3. Get all orders in date range for a user  (10k/day, <100ms)
  4. Get order items for an order             (100k/day, <10ms)
  5. Get user profile                         (200k/day, <10ms)

Design decisions:
  - Pattern 1,4: embed items IN the order document (no join needed)
  - Pattern 2,3: user_id as partition key, order_date as sort key
  - Pattern 5: separate user collection/table
  - Denormalize: store user_name in order (avoid lookup for display)

MongoDB Patterns

Embed vs Reference Decision

Factor	Embed	Reference
Read together?	Always	Sometimes
Array growth	Bounded (<100)	Unbounded
Update frequency	Low	High (independent updates)
Document size	Fits in 16MB	Would exceed limit
Data duplication OK?	Yes	No (single source of truth)

Schema Design Patterns

python

from pymongo import MongoClient
from datetime import datetime

db = MongoClient()["ecommerce"]

# Pattern 1: Embedded (1:few, always read together)
order = {
    "_id": "ord_abc123",
    "user_id": "usr_456",
    "user_name": "Alice",          # Denormalized from users collection
    "created_at": datetime.utcnow(),
    "status": "shipped",
    "items": [                     # Embedded -- always fetched with order
        {"sku": "WIDGET-1", "name": "Blue Widget", "qty": 2, "price": 9.99},
        {"sku": "GADGET-3", "name": "Red Gadget", "qty": 1, "price": 24.99},
    ],
    "total": 44.97,
}
db.orders.insert_one(order)

# Pattern 2: Reference (1:many, unbounded growth)
# Blog post with comments -- comments can grow to thousands
post = {
    "_id": "post_789",
    "title": "NoSQL Modeling",
    "body": "...",
    "comment_count": 0,  # Cached count to avoid counting query
}

comment = {
    "_id": "cmt_001",
    "post_id": "post_789",    # Reference to parent
    "author": "Bob",
    "text": "Great post!",
    "created_at": datetime.utcnow(),
}

# Pattern 3: Bucket pattern (time-series, IoT)
# Instead of one doc per measurement, bucket by hour
sensor_bucket = {
    "_id": "sensor_1_2024010112",  # sensor_id + YYYYMMDDHH
    "sensor_id": "sensor_1",
    "start": datetime(2024, 1, 1, 12, 0),
    "count": 60,
    "measurements": [
        {"ts": datetime(2024, 1, 1, 12, 0, 0), "temp": 22.1, "humidity": 45},
        {"ts": datetime(2024, 1, 1, 12, 1, 0), "temp": 22.3, "humidity": 44},
        # ... up to 60 per hour
    ],
    "avg_temp": 22.2,  # Pre-computed aggregates
    "max_temp": 23.1,
}

MongoDB Indexes

python

# Compound index for user orders by date (covers patterns 2 and 3)
db.orders.create_index([("user_id", 1), ("created_at", -1)])

# Text index for search
db.posts.create_index([("title", "text"), ("body", "text")])

# TTL index for auto-expiring documents
db.sessions.create_index("expires_at", expireAfterSeconds=0)

# Partial index (only index active orders -- saves space)
db.orders.create_index(
    [("user_id", 1), ("created_at", -1)],
    partialFilterExpression={"status": {"$ne": "cancelled"}},
)

DynamoDB Patterns

Single-Table Design

python

import boto3
from datetime import datetime

dynamodb = boto3.resource("dynamodb")
table = dynamodb.Table("app-data")

# All entities in ONE table with overloaded PK/SK

# User entity
table.put_item(Item={
    "PK": "USER#usr_456",
    "SK": "PROFILE",
    "name": "Alice",
    "email": "alice@example.com",
    "created_at": "2024-01-01T00:00:00Z",
    "entity_type": "User",
})

# Order entity (under user partition)
table.put_item(Item={
    "PK": "USER#usr_456",
    "SK": "ORDER#2024-01-15#ord_abc123",  # Date prefix for range queries
    "order_id": "ord_abc123",
    "status": "shipped",
    "total": "44.97",
    "entity_type": "Order",
})

# Order items (under order partition for direct lookup)
table.put_item(Item={
    "PK": "ORDER#ord_abc123",
    "SK": "ITEM#WIDGET-1",
    "sku": "WIDGET-1",
    "name": "Blue Widget",
    "qty": 2,
    "price": "9.99",
    "entity_type": "OrderItem",
})

# Query: Get user profile
resp = table.get_item(Key={"PK": "USER#usr_456", "SK": "PROFILE"})

# Query: Get all orders for user (sorted by date)
resp = table.query(
    KeyConditionExpression="PK = :pk AND begins_with(SK, :sk)",
    ExpressionAttributeValues={":pk": "USER#usr_456", ":sk": "ORDER#"},
    ScanIndexForward=False,  # Newest first
)

# Query: Get orders in date range
resp = table.query(
    KeyConditionExpression="PK = :pk AND SK BETWEEN :start AND :end",
    ExpressionAttributeValues={
        ":pk": "USER#usr_456",
        ":start": "ORDER#2024-01-01",
        ":end": "ORDER#2024-01-31",
    },
)

GSI Overloading

python

# GSI1: Inverted index (access order by order_id directly)
# Main table: PK=USER#id, SK=ORDER#date#id
# GSI1:       PK=ORDER#id, SK=USER#id
table.put_item(Item={
    "PK": "USER#usr_456",
    "SK": "ORDER#2024-01-15#ord_abc123",
    "GSI1PK": "ORDER#ord_abc123",     # GSI partition key
    "GSI1SK": "USER#usr_456",          # GSI sort key
    "order_id": "ord_abc123",
    "status": "shipped",
    "entity_type": "Order",
})

# Query GSI: Get order by order_id
resp = table.query(
    IndexName="GSI1",
    KeyConditionExpression="GSI1PK = :pk",
    ExpressionAttributeValues={":pk": "ORDER#ord_abc123"},
)

# GSI2: Status index (get all orders by status)
# GSI2PK = STATUS#shipped, GSI2SK = date

Partition Key Selection

Pattern	Key Design	Rationale
User data	`USER#{user_id}`	Natural partition, bounded size
Time-series	`SENSOR#{id}#YYYY-MM-DD`	Prevent hot partition; shard by day
High-write	`ITEM#{id}#SHARD#{0-9}`	Write sharding for hot keys
Global config	`CONFIG#GLOBAL`	Single item, cache it

Redis Data Structures

Structure	Use When	Example
String	Simple key-value, counters, cache	Session data, feature flags
Hash	Object with fields	User profile fields
List	Ordered collection, queue	Job queue, recent items
Set	Unique members, intersections	Tags, online users
Sorted Set	Ranked data, range queries	Leaderboards, rate limiting
Stream	Event log, pub/sub with history	Activity feed, event sourcing

python

import redis

r = redis.Redis(decode_responses=True)

# Hash: user profile (better than serialized JSON -- update fields individually)
r.hset("user:456", mapping={"name": "Alice", "email": "alice@example.com", "login_count": "0"})
r.hincrby("user:456", "login_count", 1)
profile = r.hgetall("user:456")

# Sorted set: leaderboard
r.zadd("leaderboard:weekly", {"alice": 2500, "bob": 1800, "carol": 3200})
top_10 = r.zrevrange("leaderboard:weekly", 0, 9, withscores=True)
alice_rank = r.zrevrank("leaderboard:weekly", "alice")  # 0-indexed

# Sorted set: rate limiting (sliding window)
import time

def is_rate_limited(user_id: str, limit: int = 100, window_s: int = 60) -> bool:
    key = f"rate:{user_id}"
    now = time.time()
    pipe = r.pipeline()
    pipe.zremrangebyscore(key, 0, now - window_s)  # Remove old entries
    pipe.zadd(key, {f"{now}": now})                  # Add current request
    pipe.zcard(key)                                   # Count in window
    pipe.expire(key, window_s)                        # TTL cleanup
    _, _, count, _ = pipe.execute()
    return count > limit

# Stream: event log
r.xadd("events:orders", {"type": "created", "order_id": "ord_123", "user_id": "usr_456"})
# Read latest events
events = r.xrevrange("events:orders", count=10)

# Cache with TTL
r.setex("cache:product:789", 300, '{"name": "Widget", "price": 9.99}')  # 5min TTL

Consistency Patterns

Pattern	Consistency	Use When
Read-your-writes	Session-level	User sees their own updates immediately
Eventual consistency	None guaranteed	Analytics, feeds, non-critical reads
Strong consistency	Immediate	Financial data, inventory counts
Write-behind cache	Eventual	High-read, tolerate stale

python

# DynamoDB: strong consistency per-read
resp = table.get_item(
    Key={"PK": "USER#456", "SK": "BALANCE"},
    ConsistentRead=True,  # Costs 2x RCU but guarantees latest
)

# MongoDB: read concern + write concern
from pymongo import ReadPreference, WriteConcern

# Strong: write to majority, read from primary
collection = db.get_collection(
    "orders",
    write_concern=WriteConcern(w="majority"),
    read_preference=ReadPreference.PRIMARY,
)

# Eventual: read from secondaries (lower latency, possibly stale)
collection_eventual = db.get_collection(
    "orders",
    read_preference=ReadPreference.SECONDARY_PREFERRED,
)

Migration from Relational

Step-by-Step

code

1. Map access patterns (not tables)
   - List every SQL query your app runs
   - Group by frequency and latency requirement

2. Denormalize JOIN results
   - If you always JOIN orders + users: embed user_name in order
   - If you sometimes JOIN: reference with user_id

3. Handle relationships
   - 1:1 → embed
   - 1:few (bounded) → embed array
   - 1:many (unbounded) → reference (separate collection/item)
   - many:many → reference array on one side, or adjacency list

4. Replace transactions
   - Single-document operations are atomic in MongoDB
   - Use DynamoDB TransactWriteItems for multi-item
   - Redesign to minimize multi-document transactions

5. Migrate incrementally
   - Dual-write to both databases during transition
   - Shadow-read from NoSQL, compare with SQL results
   - Switch reads to NoSQL once validated
   - Remove SQL writes last

Relational to MongoDB Example

sql

-- Relational
SELECT o.id, o.total, u.name, u.email,
       oi.sku, oi.qty, oi.price
FROM orders o
JOIN users u ON o.user_id = u.id
JOIN order_items oi ON oi.order_id = o.id
WHERE o.user_id = 456
ORDER BY o.created_at DESC;

python

# MongoDB: single query, no joins needed
orders = db.orders.find(
    {"user_id": "usr_456"},
    sort=[("created_at", -1)],
)
# Each order already contains:
#   user_name (denormalized)
#   items[] (embedded)

Relational to DynamoDB Example

sql

-- Relational: 3 tables, 2 joins
SELECT * FROM orders WHERE user_id = 456 AND created_at > '2024-01-01';
SELECT * FROM order_items WHERE order_id = 'abc123';

python

# DynamoDB: 2 queries, no joins
# Query 1: user's orders in date range
orders = table.query(
    KeyConditionExpression="PK = :pk AND SK BETWEEN :start AND :end",
    ExpressionAttributeValues={
        ":pk": "USER#usr_456",
        ":start": "ORDER#2024-01-01",
        ":end": "ORDER#2024-12-31",
    },
)

# Query 2: order items (if not embedded)
items = table.query(
    KeyConditionExpression="PK = :pk AND begins_with(SK, :sk)",
    ExpressionAttributeValues={":pk": "ORDER#ord_abc123", ":sk": "ITEM#"},
)

Gotchas

•Modeling entities before access patterns: NoSQL design starts with queries, not ER diagrams; design for reads, not normalization
•Unbounded arrays in MongoDB: embedding 10k comments in a post hits the 16MB limit; reference instead and paginate
•Hot partitions in DynamoDB: a single PK receiving disproportionate traffic throttles; add write sharding for hot keys
•DynamoDB 400KB item limit: embed carefully; large items hit the limit fast; store blobs in S3, reference by key
•Scanning instead of querying: DynamoDB full table scans are expensive and slow; if you need one, your key design is wrong
•Redis as primary database: Redis is a cache/data-structure server; data loss on restart unless using AOF persistence; always have a source of truth elsewhere
•Ignoring GSI costs in DynamoDB: every GSI duplicates data and consumes its own capacity; 5 GSIs on a hot table = 6x write cost
•MongoDB without indexes: queries without index support cause collection scans; use explain() to verify index usage
•Eventual consistency surprises: write then immediately read may return stale data; use strong consistency for read-after-write patterns
•Over-denormalization: duplicating user email in 10 collections means updating 10 places when it changes; denormalize what's read-heavy and rarely updated
•Forgetting TTL: cache entries and session data without expiry grow forever; set TTLs on everything temporal