AgentSkillsCN

altinity-expert-clickhouse-errors

深入排查 ClickHouse 查询失败、异常、崩溃及各类错误模式。适用于错误分析与故障调查。

SKILL.md
--- frontmatter
name: altinity-expert-clickhouse-errors
description: Investigate ClickHouse query failures, exceptions, crashes, and error patterns. Use for error analysis and failure investigation.

Error Investigation and Exception Analysis

Investigate query failures, exceptions, crashes, and error patterns.


Quick Diagnostics

1. Recent Errors Summary

sql
select
    toStartOfHour(event_time) as hour,
    count() as error_count,
    uniq(exception_code) as unique_errors,
    uniq(user) as users_affected
from system.query_log
where type like 'Exception%'
  and event_date = today()
group by hour
order by hour desc

2. Error Distribution by Code

sql
select
    exception_code,
    count() as occurrences,
    any(substring(exception, 1, 100)) as example_message,
    min(event_time) as first_seen,
    max(event_time) as last_seen
from system.query_log
where type like 'Exception%'
  and event_date >= today() - 1
group by exception_code
order by occurrences desc
limit 30

3. Recent Exceptions Detail

sql
select
    event_time,
    user,
    exception_code,
    substring(exception, 1, 200) as exception,
    query_kind,
    substring(query, 1, 100) as query_preview
from system.query_log
where type like 'Exception%'
  and event_date = today()
order by event_time desc
limit 50

Common Error Analysis

Memory Errors (Code 241)

sql
select
    event_time,
    user,
    formatReadableSize(memory_usage) as memory_at_failure,
    formatReadableSize(read_bytes) as read_bytes,
    substring(exception, 1, 150) as exception,
    substring(query, 1, 80) as query_preview
from system.query_log
where type like 'Exception%'
  and exception_code = 241  -- MEMORY_LIMIT_EXCEEDED
  and event_date >= today() - 1
order by event_time desc
limit 30

Solutions: See altinity-expert-clickhouse-memory for memory optimization.

Too Many Parts (Code 252)

sql
select
    event_time,
    user,
    arrayStringConcat(tables, ', ') as tables,
    substring(exception, 1, 150) as exception
from system.query_log
where type like 'Exception%'
  and exception_code = 252  -- TOO_MANY_PARTS
  and event_date >= today() - 1
order by event_time desc
limit 30

Solutions: See altinity-expert-clickhouse-merges and altinity-expert-clickhouse-ingestion for part management.

Timeout Errors (Code 159)

sql
select
    event_time,
    user,
    query_duration_ms,
    formatReadableSize(read_bytes) as read_bytes,
    substring(query, 1, 100) as query_preview
from system.query_log
where type like 'Exception%'
  and exception_code = 159  -- TIMEOUT_EXCEEDED
  and event_date >= today() - 1
order by event_time desc
limit 30

Table/Column Not Found (Codes 60, 16)

sql
select
    event_time,
    user,
    exception_code,
    substring(exception, 1, 150) as exception,
    substring(query, 1, 100) as query_preview
from system.query_log
where type like 'Exception%'
  and exception_code in (60, 16)  -- TABLE_DOESNT_EXIST, NO_SUCH_COLUMN_IN_TABLE
  and event_date = today()
order by event_time desc
limit 30

Crash Analysis

Recent Crashes

sql
select
    event_time,
    signal,
    thread_id,
    query_id,
    substring(trace_full, 1, 500) as stack_trace,
    substring(query, 1, 100) as query_preview
from system.crash_log
where event_time > now() - interval 7 day
order by event_time desc
limit 20

Crash Summary

sql
select
    toDate(event_time) as day,
    count() as crashes,
    groupUniqArray(signal) as signals
from system.crash_log
where event_time > now() - interval 30 day
group by day
order by day desc

Error Patterns by User/Client

Errors by User

sql
select
    user,
    count() as errors,
    groupUniqArray(exception_code) as error_codes,
    max(event_time) as last_error
from system.query_log
where type like 'Exception%'
  and event_date = today()
group by user
order by errors desc

Errors by Client

sql
select
    client_hostname,
    client_name,
    count() as errors,
    groupUniqArray(exception_code) as error_codes
from system.query_log
where type like 'Exception%'
  and event_date = today()
group by client_hostname, client_name
order by errors desc
limit 20

Error Log from text_log

Critical/Error Level Messages

sql
select
    event_time,
    level,
    logger_name,
    substring(message, 1, 200) as message
from system.text_log
where level in ('Fatal', 'Critical', 'Error')
  and event_time > now() - interval 1 hour
order by event_time desc
limit 50

Errors by Component

sql
select
    logger_name,
    count() as errors,
    max(event_time) as last_seen
from system.text_log
where level in ('Fatal', 'Critical', 'Error')
  and event_time > now() - interval 24 hour
group by logger_name
order by errors desc
limit 30

System Warnings

sql
select message as warning
from system.warnings

Error Code Reference

CodeNameCommon Cause
60TABLE_DOESNT_EXISTWrong table name or database
62SYNTAX_ERRORInvalid SQL
159TIMEOUT_EXCEEDEDQuery too slow
241MEMORY_LIMIT_EXCEEDEDQuery uses too much RAM
252TOO_MANY_PARTSInsert too fast, merges behind
319UNKNOWN_PACKETNetwork/client issues
341UNFINISHEDOperation interrupted
16NO_SUCH_COLUMNColumn doesn't exist
36CANNOT_READ_ALL_DATACorruption or network
164READONLYReplica in readonly mode
242TABLE_IS_READ_ONLYTable locked
243TABLE_IS_DROPPEDConcurrent DROP
254RECEIVED_ERROR_FROM_REMOTEDistributed query failure

Distributed Query Errors

sql
select
    event_time,
    initial_query_id,
    exception_code,
    substring(exception, 1, 150) as exception,
    substring(query, 1, 80) as query_preview
from system.query_log
where type like 'Exception%'
  and event_date = today()
  and (exception_code = 254 or query ilike '%distributed%')
order by event_time desc
limit 30

Error Rate Analysis

Error Rate Over Time

sql
select
    toStartOfFiveMinutes(event_time) as ts,
    count() as total_queries,
    countIf(type like 'Exception%') as errors,
    round(100.0 * countIf(type like 'Exception%') / count(), 2) as error_rate_pct
from system.query_log
where event_time > now() - interval 6 hour
group by ts
order by ts desc

Error Rate by Query Type

sql
select
    query_kind,
    count() as total,
    countIf(type like 'Exception%') as errors,
    round(100.0 * countIf(type like 'Exception%') / count(), 2) as error_rate_pct
from system.query_log
where event_date = today()
group by query_kind
order by errors desc

Ad-Hoc Query Guidelines

Required Safeguards

sql
-- Time bound
where event_date >= today() - 1
-- or
where event_time > now() - interval 1 hour

-- Limit results
limit 100

Key Tables

  • system.query_log (type like 'Exception%') - Query failures
  • system.text_log (level in Error/Critical/Fatal) - Server errors
  • system.crash_log - Server crashes
  • system.warnings - Active warnings

Cross-Module Triggers

FindingLoad ModuleReason
Memory errorsaltinity-expert-clickhouse-memoryMemory analysis
TOO_MANY_PARTSaltinity-expert-clickhouse-merges, altinity-expert-clickhouse-ingestionPart management
Replication errorsaltinity-expert-clickhouse-replicationReplica status
Distributed errorsaltinity-expert-clickhouse-replicationCluster health
Unknown errorsaltinity-expert-clickhouse-text-logDeep log analysis
Crashesaltinity-expert-clickhouse-text-logStack trace analysis

Alerting Recommendations

Set alerts for:

  • Error rate > 5%
  • Memory errors > 10/hour
  • Any crashes
  • Readonly replica errors
  • TOO_MANY_PARTS errors