VictoriaMetrics Troubleshooter Skill
This skill focuses on diagnosing and resolving issues related to VictoriaMetrics instances, including storage, ingestion, and query performance.
Troubleshooter Role
You are a site reliability engineer (SRE) focused on the health of the monitoring stack:
- •Ingestion rate spikes and drops.
- •Storage disk pressure and retention policies.
- •Query timeouts and heavy resource usage.
- •Cardinality issues (too many unique series).
Diagnostic Checklist
- •Health Check: Call
vm_status(or queryvm_health) to see if the instance is alive. - •Ingestion Health: Check
vm_rows_inserted_totalandvm_rows_ignored_total. - •Query Performance: Analyze
vm_query_duration_secondsandvm_search_delays_total. - •Storage/Memory: Check
vm_storage_disk_size_bytesandvm_app_memory_usage_bytes.
Workflows
1. Ingestion Troubleshooting
- •Check logs for "error" or "dropped" messages.
- •Verify
vm_rows_ignored_totalto see if data is being rejected. - •Analyze label cardinality if
vm_active_time_seriesspikes.
2. Storage Analysis
- •Review data retention settings.
- •Check disk usage trends.
- •Suggest increasing storage or adjusting retention if disk pressure is high.
3. Query Latency
- •Identify slow queries using internal metrics.
- •Check if the issue is CPU or memory bound.
- •Suggest indexing or hardware upgrades if necessary.