AgentSkillsCN

diagnose

对Apache Cassandra集群进行系统化的故障排查。在诊断性能问题、延迟问题、节点故障,或处理异常行为时使用此功能。

SKILL.md
--- frontmatter
name: diagnose
description: Systematic troubleshooting for Apache Cassandra clusters. Use when diagnosing performance issues, latency problems, node failures, or unexpected behavior.
argument-hint: [symptom, error message, or problem description]

Cassandra Diagnostics

You are an expert Cassandra troubleshooter applying systematic diagnostic methodologies.

Problem Solving Strategy

Double Loop Learning Approach

When troubleshooting Cassandra issues, apply double loop learning:

Single Loop (Immediate Fix):

  • Identify the immediate symptom
  • Apply a quick fix to restore service
  • Document what was done

Double Loop (Root Cause & Prevention):

  • Question the underlying assumptions that led to the problem
  • Analyze why the system design allowed this issue to occur
  • Challenge existing mental models about how the cluster should behave
  • Implement systemic changes to prevent recurrence
  • Update monitoring and alerting to catch early warning signs

Always ask: "Why did our existing approach fail to prevent this?"

USE Method for System Resource Analysis

Apply the USE Method (Utilization, Saturation, Errors) systematically to each resource:

CPU:

  • Utilization: top, mpstat, nodetool tpstats for thread pool usage
  • Saturation: Run queue length, thread pool pending tasks
  • Errors: Check for compaction/repair failures due to CPU constraints

Memory:

  • Utilization: Heap usage, off-heap (memtables, bloom filters, compression metadata)
  • Saturation: GC frequency and duration, OOM errors
  • Errors: java.lang.OutOfMemoryError, allocation failures

Disk I/O:

  • Utilization: iostat %util, read/write throughput
  • Saturation: await latency, queue depth
  • Errors: Disk errors in system logs, failed compactions

Network:

  • Utilization: Bandwidth usage between nodes, client connections
  • Saturation: Dropped messages (nodetool tpstats), connection timeouts
  • Errors: Gossip failures, streaming errors, hinted handoff backlog

Storage:

  • Utilization: Disk space per node, data directory usage
  • Saturation: Approaching disk limits, compaction unable to keep up
  • Errors: Write failures due to full disk

Thread Pools:

  • Not a hardware resource, but has similar characteristics
  • Utilization: Active Tasks
  • Saturation: Pending Tasks
  • Errors: Error Rate

Outlier Analysis

When diagnosing issues, always compare nodes to identify outliers:

Key Questions:

  • Are some nodes showing different latency profiles?
  • Is one node handling disproportionate traffic (hot partition)?
  • Do some nodes have different resource utilization patterns?
  • Are there nodes with more tombstones or larger SSTables?
  • Is gossip healthy across all nodes?

Comparison Points:

  • Read/write latencies per node (nodetool tablehistograms)
  • Compaction pending per node
  • Thread pool statistics across nodes
  • GC behavior differences
  • Disk utilization variance
  • Network latency between specific node pairs

Tools:

  • nodetool status - basic health overview
  • nodetool netstats - streaming and network state
  • nodetool tpstats - thread pool comparison
  • Metrics systems (Prometheus/Grafana) for time-series comparison
  • Profiling: Async Profiler. See Jon's video: https://www.youtube.com/watch?v=yNZtnzjyJRI

Common Issues Checklist

Latency Issues

  • Check GC logs for long pauses
  • Review thread pool saturation (nodetool tpstats)
  • Look for disk I/O saturation (iostat)
  • Check for tombstone-heavy reads
  • Verify network latency between nodes
  • Review consistency level vs replication factor

Node Failures

  • Check gossip status (nodetool gossipinfo)
  • Review system logs for errors
  • Verify disk space and health
  • Check for network partitions
  • Review hinted handoff status

Compaction Issues

  • Check pending compaction tasks
  • Review compaction throughput settings
  • Look for large partitions blocking compaction
  • Verify disk space for compaction headroom

Memory Pressure

  • Review heap usage and GC frequency
  • Check off-heap memory (memtables, bloom filters)
  • Look for memory leaks in native memory
  • Verify JVM settings match workload

Diagnostic Commands

bash
# Overall status
nodetool status
nodetool info

# Thread pools
nodetool tpstats

# Table statistics
nodetool tablestats <keyspace>.<table>
nodetool tablehistograms <keyspace>.<table>

# Compaction
nodetool compactionstats
nodetool compactionhistory

# Network
nodetool netstats
nodetool gossipinfo

# Ring and token distribution
nodetool ring
nodetool describecluster

Guidelines

  1. Start with USE method - systematically check each resource
  2. Compare nodes to find outliers
  3. Correlate symptoms with recent changes (deployments, traffic patterns, config changes)
  4. Check the simple things first (disk space, network connectivity)
  5. Use double loop learning to prevent recurrence