AgentSkillsCN

platxa-monitoring

使用 Prometheus 指标与 Loki 日志,为 Platxa 平台提供可观测性指南。查询指标、分析日志、配置告警,并排查问题。

SKILL.md
--- frontmatter
name: platxa-monitoring
description: Observability guide for Platxa platform using Prometheus metrics and Loki logs. Query metrics, analyze logs, configure alerts, and troubleshoot issues.
allowed-tools:
  - Read
  - Bash
  - Glob
  - Grep
metadata:
  version: "1.0.0"
  tags:
    - guide
    - monitoring
    - prometheus
    - loki
    - observability

Platxa Monitoring

Guide for observability in the Platxa platform using Prometheus metrics and Loki logs.

Overview

This skill covers the complete observability stack:

ComponentPurposeAccess
PrometheusMetrics collection and alertingPort 9090
LokiLog aggregation and queryingPort 3100
GrafanaVisualization dashboardsPort 3000
AlertmanagerAlert routing and notificationPort 9093
Fluent BitLog collection from podsDaemonSet

Prerequisites

Verify monitoring stack is running:

bash
kubectl get pods -n monitoring
# prometheus-*, loki-*, grafana-*, fluent-bit-*

# Access Grafana locally
kubectl port-forward svc/grafana 3000:80 -n monitoring

Prometheus Metrics

Common PromQL Queries

Instance Resource Metrics

promql
# Memory usage (bytes)
container_memory_working_set_bytes{
  namespace="instance-{name}",
  container="odoo"
}

# CPU usage (millicores)
sum(rate(container_cpu_usage_seconds_total{
  namespace="instance-{name}",
  container="odoo"
}[5m])) * 1000

# Storage usage (ratio)
kubelet_volume_stats_used_bytes{namespace="instance-{name}"}
/
kubelet_volume_stats_capacity_bytes{namespace="instance-{name}"}

Waking Service Metrics

promql
# Instance state distribution
waking_instances_by_state{state="running"}
waking_instances_by_state{state="sleeping"}
waking_instances_by_state{state="waking"}
waking_instances_by_state{state="error"}

# Total tracked instances
waking_instances_total

PostgreSQL Metrics

promql
# Database size
pg_database_size{datname=~"instance_.*"}

# Active connections per database
sum by (datname) (pg_stat_activity_count{state="active"})

# Slow queries (>30s)
pg_slow_queries_count

Recording Rules

Pre-computed metrics for dashboard efficiency:

Recording RuleDescription
instance:memory_usage:ratioMemory usage percentage
instance:cpu_usage:millicoresCPU in millicores
instance:storage_usage:ratioStorage percentage
instance:restarts:1hRestart count (1 hour)
postgresql:connections:by_databaseConnections per DB

ServiceMonitors

Automatic scrape targets via Prometheus Operator:

TargetNamespacePortInterval
postgres-exporterpostgres-system918730s
traefiktraefik-system808230s
waking-servicetraefik-system910030s
lokimonitoring310030s
cert-managercert-manager940260s

Loki Logs

LogQL Query Patterns

Basic Label Filtering

logql
# All logs from an instance
{namespace="instance-abc123xy"}

# Specific container
{namespace="instance-abc123xy", container="odoo"}

# Multiple namespaces (regex)
{namespace=~"instance-.*"}

Pattern Matching

logql
# Contains error (case insensitive)
{namespace=~"instance-.*"} |~ "(?i)error"

# Exact match
{namespace=~"instance-.*"} |= "FATAL"

# Exclude pattern
{namespace=~"instance-.*"} != "healthcheck"

# Regex pattern
{namespace=~"instance-.*"} |~ "connection refused|timeout"

Aggregations

logql
# Error count over time
count_over_time({namespace="instance-abc123xy"} |~ "ERROR" [5m])

# Error rate per minute
rate({namespace=~"instance-.*"} |~ "ERROR" [1m])

# Top namespaces by log volume
topk(10, sum by (namespace) (rate({namespace=~"instance-.*"}[5m])))

Log Labels

Fluent Bit enriches logs with Kubernetes metadata:

LabelSourceExample
namespacePod namespaceinstance-abc123xy
containerContainer nameodoo
podPod nameodoo-abc123xy-7f8b9c
appPod labelodoo
jobStatic labelfluentbit

Multiline Log Handling

Python stack traces are automatically combined:

code
# Fluent Bit parser detects:
# - "Traceback (most recent call last):"
# - Indented continuation lines
# - "Error:", "Exception:", "Warning:"

Alerting

Alert Categories

Infrastructure Alerts

AlertConditionSeverity
PostgreSQLDownTarget unreachablecritical
TraefikDownTarget unreachablecritical
WakingServiceDownTarget unreachablecritical
CertificateExpiringSoon<14 dayswarning
CertificateExpiringCritical<3 dayscritical

Instance Alerts

AlertConditionSeverity
OdooStorageHigh>90% usedwarning
OdooStorageCritical>95% usedcritical
OdooHighMemory>85% usedwarning
OdooOOMKilledContainer killedcritical
OdooPodRestartLoop>3 restarts/hourwarning
OdooWakeFailedScale-up failedcritical

Database Alerts

AlertConditionSeverity
PostgreSQLHighConnections>20 active per DBwarning
PostgreSQLTotalConnectionsCritical>150 totalcritical
PostgreSQLSlowQueries>3 queries >30swarning

Log-Based Alerts (Loki)

AlertLogQL PatternSeverity
OdooDBConnectionErrorDB connection errorscritical
OdooHighErrorRate>50 errors in 5mwarning

Alertmanager Routing

yaml
Routes:
  critical → platform-odoo (webhook, 10s group wait)
  warning  → platform-odoo (30s group wait)
  null     → silence (Watchdog, informational)

Webhook: Bearer token auth to Odoo platform
Grouping: By namespace and alertname

Workflow

Investigating Issues

Step 1: Identify Issue Type

SymptomPrimary ToolSecondary
Slow responsePrometheusLoki
Crashes/restartsLokiPrometheus
Out of memoryPrometheusLoki
Connection errorsLokiPrometheus
High resource usagePrometheus-

Step 2: Scope to Target

bash
# Find instance namespace
kubectl get ns -l platxa.io/tier=instance | grep {name}

# Identify components
kubectl get pods -n instance-{name}

Step 3: Query Data

Prometheus (via Grafana or API):

bash
# API query
curl -G 'http://localhost:9090/api/v1/query' \
  --data-urlencode 'query=instance:memory_usage:ratio{namespace="instance-abc123xy"}'

Loki (via Grafana or API):

bash
# API query
curl -G 'http://localhost:3100/loki/api/v1/query' \
  --data-urlencode 'query={namespace="instance-abc123xy"} |~ "ERROR"'

Step 4: Correlate

  • Match metric spikes with log timestamps
  • Check alert history in Alertmanager
  • Review recent events: kubectl get events -n instance-{name}

Step 5: Recommend Actions

  • Use platxa-k8s-ops for operational commands
  • Suggest resource adjustments
  • Link to relevant troubleshooting guides

Examples

Example 1: Memory Investigation

User: "Instance abc123xy is running slow"

Response:

Check memory usage:

promql
instance:memory_usage:ratio{namespace="instance-abc123xy"}
# Result: 0.92 (92% of limit)

Check for OOM events:

logql
{namespace="instance-abc123xy"} |~ "OOM|killed|memory"

Recent restarts:

promql
instance:restarts:1h{namespace="instance-abc123xy"}
# Result: 2 restarts

Finding: Memory pressure causing slowdown. Recommend increasing memory limit or reducing Odoo workers.

Example 2: Error Log Analysis

User: "Users report 500 errors on demo instance"

Response:

Search recent errors:

logql
{namespace="instance-demo", container="odoo"}
  |~ "ERROR|Exception|Traceback"
  | json
  | line_format "{{.message}}"

Count errors by type:

logql
sum by (level) (
  count_over_time({namespace="instance-demo"} |~ "ERROR|WARNING" [1h])
)

Check database connectivity:

logql
{namespace="instance-demo"} |~ "could not connect|connection refused"

Finding: Database connection errors detected. Check PostgreSQL status with platxa-k8s-ops.

Example 3: Alert Investigation

User: "Got alert for OdooStorageCritical"

Response:

Check current storage usage:

promql
instance:storage_usage:ratio{namespace=~"instance-.*"} > 0.90

Identify which PVC is full:

promql
kubelet_volume_stats_used_bytes{namespace="instance-{name}"}
/
kubelet_volume_stats_capacity_bytes{namespace="instance-{name}"}

Results:

  • odoo-{name}-addons: 45% used
  • odoo-{name}-filestore: 97% used (CRITICAL)

Finding: Filestore PVC near capacity. Options:

  1. Clean up old attachments
  2. Expand PVC (if storage class supports)
  3. Upgrade instance tier for larger storage

Example 4: Cold Start Performance

User: "Instances taking too long to wake"

Response:

Check waking service metrics:

promql
# Current instance states
waking_instances_by_state

# Cold start timing (if exposed)
histogram_quantile(0.95, waking_cold_start_duration_bucket)

Check waking service logs:

logql
{namespace="traefik-system", container="waking-service"}
  |~ "cold start|wake|scale"
  | json

Common causes:

  • Large filestore extraction time
  • Resource scheduling delays
  • Init container timeouts

Grafana Dashboards

Pre-built dashboards available:

DashboardPurpose
Cluster OverviewNode and resource summary
Instances OverviewAll instances at a glance
Instance DetailSingle instance deep dive
Postgres SystemDatabase metrics
Edge OverviewTraefik and ingress
Scale-to-ZeroWake/sleep patterns
Monitoring HealthStack self-monitoring
Instance StatusEmbeddable status widget

Access: Grafana → Dashboards → Browse

Troubleshooting

No Metrics Data

SymptomCauseFix
Target downPod not runningCheck pod status
No ServiceMonitorMissing CRDApply ServiceMonitor
Wrong labelsSelector mismatchCheck release: prometheus label

No Logs in Loki

SymptomCauseFix
Empty resultsWrong namespaceVerify label values
Missing logsFluent Bit downCheck DaemonSet
Delayed logsIngestion backlogCheck Loki metrics

Alerts Not Firing

SymptomCauseFix
No alertsRule not loadedCheck PrometheusRule CRD
Not routingWrong labelsVerify severity label
Not receivedWebhook errorCheck Alertmanager logs

Output Checklist

After monitoring investigation:

  • Relevant PromQL/LogQL query constructed
  • Time range appropriate for issue
  • Results interpreted correctly
  • Metrics and logs correlated
  • Root cause identified or narrowed
  • Actionable recommendations provided
  • Follow-up monitoring suggested

Related Resources

  • PromQL Queries: See references/promql-queries.md
  • LogQL Queries: See references/logql-queries.md
  • Alert Rules: See references/alert-rules.md
  • K8s Operations: Use platxa-k8s-ops skill