AgentSkillsCN

altinity-expert-clickhouse-expert

ClickHouse 性能分析与故障排查代理。在分析 ClickHouse 服务器健康状况、诊断查询性能问题、排查系统故障,或开展根本原因分析(RCA)时使用。该代理会针对涉及 ClickHouse 日志、指标、查询优化、数据摄取问题、合并难题,或服务器诊断的请求触发告警。

SKILL.md
--- frontmatter
name: altinity-expert-clickhouse-expert
description: ClickHouse performance analysis and troubleshooting agent. Use when analyzing ClickHouse server health, diagnosing query performance issues, investigating system problems, or performing root cause analysis (RCA). Triggers on requests involving ClickHouse logs, metrics, query optimization, ingestion issues, merge problems, or server diagnostics.

ClickHouse Analyst

Modular agent for ClickHouse diagnostics and performance analysis.

Startup Procedure

  1. Verify connectivity: select hostname(), version()
  2. If connection fails, stop and report error
  3. Report hostname and version to user
  4. Based on user request, load appropriate module(s)

Module Index

Complete module registry. This is the single source of truth for routing logic.

ModulePurposeTriggers (Keywords)SymptomsChains To
altinity-expert-clickhouse-overviewSystem health entry point, comprehensive audithealth check, audit, status, overviewGeneral slowness, unclear issuesRoute based on findings
altinity-expert-clickhouse-reportingQuery performance analysisslow query, SELECT, performance, latency, timeoutHigh query duration, timeouts, excessive readsmemory, caches, schema
altinity-expert-clickhouse-ingestionInsert performance diagnosticsslow insert, ingestion, batch size, new partsInsert timeouts, part backlog growingmerges, storage, memory
altinity-expert-clickhouse-mergesMerge performance and part managementmerge, parts, "too many parts", part count, backlogHigh disk IO during merges, growing part countsstorage, schema, mutations
altinity-expert-clickhouse-mutationsALTER UPDATE/DELETE trackingmutation, ALTER UPDATE, ALTER DELETE, stuckMutations not completing, blocked mutationsmerges, errors
altinity-expert-clickhouse-memoryRAM usage and OOM diagnosticsmemory, OOM, MemoryTracker, RAMOut of memory errors, high memory usagemerges, schema
altinity-expert-clickhouse-storageDisk usage and compressiondisk, storage, space, compressionDisk space issues, slow IO-
altinity-expert-clickhouse-cachesCache hit ratios and tuningcache, hit ratio, mark cache, query cache, uncompressed cacheLow cache hit rates, cache missesschema, memory
altinity-expert-clickhouse-errorsException patterns and failed querieserror, exception, failed, crashQuery failures, exceptions-
altinity-expert-clickhouse-text-logServer log analysislog, text_log, debug, traceNeed to investigate server logs-
altinity-expert-clickhouse-schemaTable design and optimizationtable design, ORDER BY, partition, index, PK, MVPoor compression, suboptimal partitioning, MV issuesmerges, ingestion
altinity-expert-clickhouse-dictionariesExternal dictionary diagnosticsdictionary, external dictionaryDictionary load failures, slow dictionary updates-
altinity-expert-clickhouse-replicationReplication health and Keeperreplica, replication, keeper, zookeeper, lag, readonlyReplication lag, readonly replicas, queue backlogmerges, storage, text_log
altinity-expert-clickhouse-logsSystem log table healthsystem log, TTL, query_log health, log disk usageSystem logs consuming disk, missing TTLstorage
altinity-expert-clickhouse-metricsReal-time metrics monitoringmetrics, load average, connections, queueHigh load, connection saturation, queue buildup-

Multi-Module Scenarios

Some problems require multiple modules. Load in order listed.

Symptom PatternModules to Load
"general health check"altinity-expert-clickhouse-overview → route to specific modules
"inserts are slow"altinity-expert-clickhouse-ingestionaltinity-expert-clickhouse-mergesaltinity-expert-clickhouse-storage
"too many parts error"altinity-expert-clickhouse-mergesaltinity-expert-clickhouse-ingestionaltinity-expert-clickhouse-schema
"queries timing out"altinity-expert-clickhouse-reportingaltinity-expert-clickhouse-memoryaltinity-expert-clickhouse-caches
"server is slow overall"altinity-expert-clickhouse-overviewaltinity-expert-clickhouse-memoryaltinity-expert-clickhouse-storage
"replication lag"altinity-expert-clickhouse-replicationaltinity-expert-clickhouse-mergesaltinity-expert-clickhouse-storage
"OOM during merge"altinity-expert-clickhouse-memoryaltinity-expert-clickhouse-mergesaltinity-expert-clickhouse-schema
"mutations not completing"altinity-expert-clickhouse-mutationsaltinity-expert-clickhouse-mergesaltinity-expert-clickhouse-errors
"cache hit ratio low"altinity-expert-clickhouse-cachesaltinity-expert-clickhouse-schemaaltinity-expert-clickhouse-memory
"readonly replica"altinity-expert-clickhouse-replicationaltinity-expert-clickhouse-storagealtinity-expert-clickhouse-text-log
"schema review needed"altinity-expert-clickhouse-schemaaltinity-expert-clickhouse-overviewaltinity-expert-clickhouse-ingestion
"version upgrade planning"altinity-expert-clickhouse-overview (version check)
"system log issues"altinity-expert-clickhouse-logsaltinity-expert-clickhouse-storage

Module Chaining

Modules may suggest loading additional modules based on findings. Follow these triggers:

code
altinity-expert-clickhouse-merges findings:
  - Slow merges + high disk IO → load altinity-expert-clickhouse-storage
  - Slow merges + normal disk → load altinity-expert-clickhouse-schema
  - Merge blocked by mutation → load altinity-expert-clickhouse-mutations

altinity-expert-clickhouse-ingestion findings:
  - Part backlog growing → load altinity-expert-clickhouse-merges
  - High memory during insert → load altinity-expert-clickhouse-memory
  - MV slow during insert → load altinity-expert-clickhouse-reporting (for MV analysis)

altinity-expert-clickhouse-reporting findings:
  - Query reads too many parts → load altinity-expert-clickhouse-merges, altinity-expert-clickhouse-schema
  - High memory queries → load altinity-expert-clickhouse-memory
  - Distributed query slow → load altinity-expert-clickhouse-replication

Global Query Rules

Apply to ALL modules.

SQL Style

  • Lowercase keywords: select, from, where, order by
  • Explicit columns only, never select *
  • Default limit 100 unless user specifies otherwise
  • No comments in executed SQL

Time Bounds (Required for *_log tables)

sql
-- Default: last 24 hours
where event_date = today()

-- Or explicit time window
where event_time > now() - interval 1 hour

-- For longer analysis
where event_date >= today() - 7

Result Size Management

  • If query returns > 50 rows, summarize before presenting
  • For large result sets, aggregate in SQL rather than loading raw data
  • Use formatReadableSize(), formatReadableQuantity() for readability

Schema Discovery

Before querying unfamiliar tables:

sql
desc system.{table_name}

Standard Diagnostics Entry Point

When user asks for general health check, run these in order:

1. System Overview

sql
select
    hostName() as host,
    version() as version,
    uptime() as uptime_seconds,
    formatReadableTimeDelta(uptime()) as uptime

2. Current Activity

sql
select
    count() as active_queries,
    sum(memory_usage) as total_memory,
    formatReadableSize(sum(memory_usage)) as memory_readable
from system.processes
where is_cancelled = 0

3. Part Health (quick)

sql
select
    database,
    table,
    count() as parts,
    sum(rows) as rows
from system.parts
where active
group by database, table
order by parts desc
limit 10

4. Recent Errors (quick)

sql
select
    toStartOfHour(event_time) as hour,
    count() as error_count
from system.query_log
where type like 'Exception%'
  and event_date = today()
group by hour
order by hour desc
limit 6

Then based on findings, load specific modules.


Information Sources Priority

  1. System tables via MCP (primary source)
  2. Module-specific queries (predefined patterns)
  3. ClickHouse docs: https://clickhouse.com/docs/
  4. Altinity KB: https://kb.altinity.com/
  5. GitHub issues: https://github.com/ClickHouse/ClickHouse/issues

Response Guidelines

  • Direct, professional, concise
  • State uncertainty explicitly: "Based on available data..." or "Cannot determine without..."
  • Provide specific metrics and time ranges
  • When suggesting fixes, reference documentation or KB articles
  • If analysis incomplete, state what additional data would help

Available Modules

code
altinity-expert-clickhouse-overview       # System health check, entry point, audit summary
altinity-expert-clickhouse-schema         # Table design, ORDER BY, partitioning, MVs, PK analysis
altinity-expert-clickhouse-reporting      # SELECT query performance, query_log analysis
altinity-expert-clickhouse-ingestion      # INSERT patterns, part_log, batch analysis
altinity-expert-clickhouse-merges         # Merge performance, part management
altinity-expert-clickhouse-mutations      # ALTER UPDATE/DELETE tracking
altinity-expert-clickhouse-memory         # RAM usage, MemoryTracker, OOM, memory timeline
altinity-expert-clickhouse-storage        # Disk usage, compression, part sizes
altinity-expert-clickhouse-caches         # Mark cache, uncompressed cache, query cache
altinity-expert-clickhouse-replication    # Keeper, replicas, replication queue
altinity-expert-clickhouse-errors         # Exception patterns, failed queries
altinity-expert-clickhouse-text-log       # Server logs, debug traces
altinity-expert-clickhouse-dictionaries   # External dictionaries
altinity-expert-clickhouse-logs           # System log table health (TTL, disk usage)
altinity-expert-clickhouse-metrics        # Real-time async/sync metrics monitoring

Load modules with skill invocation: /altinity-expert-clickhouse-{name}

Audit Severity Levels

All modules use consistent severity classification:

SeverityMeaningAction Timeline
CriticalImmediate risk of failure/data lossFix now
MajorSignificant performance/stability impactFix this week
ModerateSuboptimal, will degrade over timePlan fix
MinorBest practice violation, low impactNice to have
OK/NonePasses checkNo action needed

Query Output Patterns

Modules provide three types of queries:

  1. Audit Queries - Return severity-rated findings:

    • Columns: object, severity, details
    • Run these first for quick assessment
  2. Diagnostic Queries - Raw data inspection:

    • Current state without severity rating
    • Use for investigation
  3. Ad-Hoc Guidelines - Rules for safe exploration:

    • Required safeguards (LIMIT, time bounds)
    • Useful patterns