AgentSkillsCN

Data Structure Selection

基于问题需求、性能约束与运行特性,系统性地选择最优的数据结构,助力决策更加高效、精准

SKILL.md
--- frontmatter
name: Data Structure Selection
description: Systematic decision-making process for choosing optimal data structure based on problem requirements, performance constraints, and operational characteristics

Data Structure Selection Framework

What It Is

A systematic decision-making process for choosing the optimal data structure based on problem requirements, performance constraints, and operational characteristics. Rather than defaulting to familiar structures, this framework evaluates trade-offs across time complexity, space usage, and modification patterns to select the structure that best matches actual access patterns and constraints.

When to Use It

  • Designing a new system component with performance requirements
  • Operations have known frequency distribution (e.g., 90% reads, 10% writes)
  • Memory constraints exist (embedded systems, mobile, large-scale systems)
  • Multiple data structures could work but unclear which is optimal
  • Need to justify data structure choice in code review or design doc
  • Migrating from one structure to another requires analysis

Execution Steps

1. Characterize the Data

Identify what you're storing and its intrinsic properties. Size matters: 100 items behave differently than 1 million. Relationships matter: hierarchical (tree), connected (graph), sequential (array), or key-value pairs (map)?

Action: Answer these questions: What is n (data size)? Are elements homogeneous or heterogeneous? Are there natural relationships between elements? Is data sparse or dense? Write down: "Storing N={size} {type} with {relationship}."

2. Profile the Operations

List all operations the data structure must support and estimate their relative frequency. The operation mix dramatically affects the optimal choice. Don't optimize for rare operations at the expense of common ones.

Action: Create a table: [Operation | Frequency % | Required Complexity]. Common operations: insert, delete, search, access by index, iterate, find min/max. Mark the 2-3 most frequent operations - these drive the decision.

3. Map Requirements to Complexity Classes

For each critical operation, determine acceptable time complexity. Some operations might have hard constraints (e.g., lookup must be O(1)), while others can tolerate O(log n) or O(n).

Action: For each frequent operation, write target complexity. Example: "Search: O(log n) required, O(1) ideal" or "Iterate: O(n) acceptable." Consider amortized vs worst-case guarantees based on use case.

4. Evaluate Candidate Structures

Generate 2-4 candidate data structures that plausibly meet requirements. For each candidate, document actual complexities for your operations and space overhead.

Action: Create comparison matrix with columns: [Structure | Insert | Delete | Search | Space | Notes]. Include both theoretical complexity and practical constants (hash table lookup is O(1) but with high constant factor).

5. Apply Decision Heuristics

Use problem-specific heuristics to narrow choices. For graphs: dense (>50% edges) → adjacency matrix, sparse → adjacency list. For repeated queries: precompute with extra space. For write-heavy: simpler structures with lower overhead. For read-heavy: complex structures with better query performance.

Action: Apply these filters: "If graph density > 50% → matrix. If repeated deletions → avoid arrays. If need ordering + fast lookup → balanced BST or skip list. If memory critical → compact structures (arrays over linked lists)."

6. Validate with Prototyping

Build a minimal prototype with the top candidate(s) using representative data. Measure actual performance, not just theoretical complexity. Memory profiling often reveals surprises (e.g., pointer overhead in linked structures).

Action: Implement core operations for top 2 candidates. Run benchmarks with realistic data size and access patterns. Profile memory usage. If performance is within 2x and code is simpler, choose the simpler structure.

Real-World Applications

Database Systems

  • B-tree indexes for disk-based storage: logarithmic search with high fanout
  • Hash indexes for in-memory databases: O(1) lookups for exact matches
  • LSM trees for write-heavy workloads: append-only structure with periodic compaction

Compilers & Interpreters

  • Symbol tables: hash tables for name resolution, scoped hashmaps for lexical environments
  • Abstract syntax trees: tree structures for representing parsed code
  • Live variable analysis: bitmaps or sets for tracking variable liveness

Network Systems

  • Routing tables: prefix trees (tries) for longest prefix matching
  • LRU caches: hash table + doubly linked list for O(1) access and eviction
  • Connection pools: circular buffers for fixed-size resource management

Game Development

  • Spatial partitioning: quadtrees/octrees for collision detection
  • Entity component systems: parallel arrays for cache-friendly iteration
  • Pathfinding: priority queues (heaps) for A* algorithm

Operating Systems

  • Process scheduling: multilevel feedback queues, red-black trees for fair scheduling
  • Memory management: buddy allocator with free lists, slab allocator for fixed-size objects
  • File systems: B-trees for directory structures, extent trees for file allocation

Anti-Patterns

Using linked lists by default → Poor cache locality and high pointer overhead (24 bytes/node on 64-bit); use dynamic arrays unless insertions/deletions in middle are frequent.

Hash tables for small datasets → Overhead exceeds benefit for n<100; linear search in array is often faster due to cache effects.

Balanced BST when ordering isn't needed → Unnecessary complexity; use hash table for unordered lookups.

Array when frequent insertions/deletions at arbitrary positions → O(n) operations dominate; consider linked structures or gap buffers.

Graph adjacency matrix for sparse graphs → Wastes Θ(V²) space when edges << V²; use adjacency lists.

Single structure for all operations → Sometimes two structures in tandem (e.g., hash + heap for O(1) lookup + O(log n) priority) outperform one compromise structure.

Success Metrics

  • Actual benchmark performance meets requirements (not just theoretical complexity)
  • Memory footprint within acceptable bounds (measure with profiler)
  • Code clarity and maintainability (simpler structure preferred if performance similar)
  • Graceful scaling to expected maximum data size
  • Low latency variance for latency-critical applications (consider worst-case, not just average)

Related Frameworks

  • Algorithm Selection: Choosing algorithms often depends on underlying data structure
  • Space-Time Tradeoff: Explicit framework for trading memory for speed (memoization, indexing)
  • Cache-Oblivious Algorithms: Designs that perform well regardless of cache parameters
  • Amortized Analysis: Understanding amortized costs for dynamic structures (dynamic arrays, splay trees)

Common Pitfalls

  • Optimizing for worst-case when average-case matters (or vice versa)
  • Ignoring cache effects and memory hierarchy (theory says O(1), practice says cache miss)
  • Not considering growth patterns (static vs dynamic sizing)
  • Choosing structure based on familiarity rather than requirements
  • Forgetting to account for overhead (pointers, alignment, metadata)
  • Not testing with realistic data distributions (sorted, random, adversarial)

Tools & Resources

  • Profiling: Valgrind (memory), perf (CPU cache misses), language-specific profilers
  • Benchmarking: Google Benchmark (C++), JMH (Java), Criterion (Rust)
  • Visualization: Data Structure Visualizations (cs.usfca.edu), VisuAlgo
  • References: "Algorithm Design Manual" (Skiena Ch. 3), "Introduction to Algorithms" (CLRS Part III)
  • Interactive: Big-O Cheat Sheet (bigocheatsheet.com), Data Structure Complexity Reference

Decision Matrix Example

Problem: Design a cache with 10K entries, 95% reads, 5% writes, must support O(1) lookup and LRU eviction.

StructureLookupInsertDeleteEvict LRUSpaceChoice
Hash TableO(1)O(1)O(1)O(n)ModerateNo - eviction too slow
Balanced BSTO(log n)O(log n)O(log n)O(1)*ModerateNo - lookup not O(1)
Hash + DLLO(1)O(1)O(1)O(1)HighYES - meets all requirements

*With additional tracking; final choice is hash table + doubly linked list (standard LRU implementation).


Framework Type: Data Structure Design Domain: Computer Science, Software Engineering Practitioner Score: 9/10 - Critical for performance optimization, rarely taught systematically Complexity: Medium - Requires understanding of complexity analysis and system constraints Prerequisites: Big-O notation, basic data structures (arrays, lists, trees, hash tables), profiling tools