Configure Log Aggregation
Implement centralized log collection, parsing, and querying with Loki/Promtail or ELK stack for operational visibility.
When to Use
- •Consolidating logs from multiple services or hosts into a searchable system
- •Replacing local log files with centralized, queryable log storage
- •Correlating logs with metrics and traces for full observability
- •Implementing structured logging with label extraction from unstructured logs
- •Setting retention policies for log data based on storage and compliance needs
- •Troubleshooting production incidents requiring log analysis across services
Inputs
- •Required: Log sources (application logs, system logs, container logs)
- •Required: Log format patterns (JSON, plaintext, syslog, etc.)
- •Optional: Label extraction rules for structured querying
- •Optional: Retention and compression policies
- •Optional: Existing log shipper configuration (Fluentd, Filebeat, Promtail)
Procedure
Step 1: Choose Log Aggregation Stack
Select between Loki (Prometheus-style) or ELK (Elasticsearch-based) based on requirements.
Loki advantages:
- •Lightweight, designed for Kubernetes and cloud-native environments
- •Label-based indexing (like Prometheus) for low storage overhead
- •Native integration with Grafana for unified dashboards
- •Horizontal scalability with object storage (S3, GCS)
- •Lower resource consumption compared to Elasticsearch
ELK advantages:
- •Full-text search across all log content (not just labels)
- •Rich query DSL and aggregations
- •Mature ecosystem with beats, logstash plugins
- •Better for compliance/audit logs requiring deep historical search
For this guide, we'll focus on Loki + Promtail (recommended for most modern setups).
Decision criteria:
Use Loki if: - You want label-based queries similar to Prometheus - Storage costs are a concern (Loki indexes only labels) - You already use Grafana for metrics - Kubernetes/container-native deployment Use ELK if: - You need full-text search across all log content - You have complex log parsing and enrichment requirements - You require advanced analytics and aggregations - Legacy systems with existing Logstash pipelines
Expected: Clear choice made based on requirements, team downloads appropriate installation artifacts.
On failure:
- •Benchmark storage requirements: Loki ~10x less than Elasticsearch for same logs
- •Evaluate query patterns: full-text search needs vs label filtering
- •Consider operational overhead: ELK requires more tuning and resources
Step 2: Deploy Loki
Install and configure Loki with appropriate storage backend.
Docker Compose deployment (docker-compose.yml):
version: '3.8'
services:
loki:
image: grafana/loki:2.9.0
ports:
- "3100:3100"
volumes:
- ./loki-config.yml:/etc/loki/local-config.yaml
- loki-data:/loki
command: -config.file=/etc/loki/local-config.yaml
restart: unless-stopped
promtail:
image: grafana/promtail:2.9.0
volumes:
- ./promtail-config.yml:/etc/promtail/config.yml
- /var/log:/var/log:ro
- /var/lib/docker/containers:/var/lib/docker/containers:ro
command: -config.file=/etc/promtail/config.yml
restart: unless-stopped
depends_on:
- loki
volumes:
loki-data:
Loki configuration (loki-config.yml):
auth_enabled: false
server:
http_listen_port: 3100
grpc_listen_port: 9096
common:
instance_addr: 127.0.0.1
path_prefix: /loki
storage:
filesystem:
chunks_directory: /loki/chunks
rules_directory: /loki/rules
replication_factor: 1
ring:
kvstore:
store: inmemory
schema_config:
configs:
- from: 2020-10-24
store: boltdb-shipper
object_store: filesystem
schema: v11
index:
prefix: index_
period: 24h
storage_config:
boltdb_shipper:
active_index_directory: /loki/boltdb-shipper-active
cache_location: /loki/boltdb-shipper-cache
cache_ttl: 24h
shared_store: filesystem
filesystem:
directory: /loki/chunks
limits_config:
enforce_metric_name: false
reject_old_samples: true
reject_old_samples_max_age: 168h # 1 week
ingestion_rate_mb: 10
ingestion_burst_size_mb: 20
max_query_length: 721h # 30 days
max_query_parallelism: 32
max_streams_per_user: 10000
max_entries_limit_per_query: 5000
chunk_store_config:
max_look_back_period: 720h # 30 days
table_manager:
retention_deletes_enabled: true
retention_period: 720h # 30 days
compactor:
working_directory: /loki/compactor
shared_store: filesystem
compaction_interval: 10m
retention_enabled: true
retention_delete_delay: 2h
retention_delete_worker_count: 150
ruler:
alertmanager_url: http://alertmanager:9093
storage:
type: local
local:
directory: /loki/rules
rule_path: /loki/rules-temp
ring:
kvstore:
store: inmemory
For production with S3 storage:
storage_config:
aws:
s3: s3://us-east-1/my-loki-bucket
s3forcepathstyle: true
boltdb_shipper:
active_index_directory: /loki/index
cache_location: /loki/cache
shared_store: s3
Expected: Loki starts successfully, health check passes at http://localhost:3100/ready, logs stored according to retention policy.
On failure:
- •Check Loki logs:
docker logs loki - •Verify storage directories exist and are writable
- •Test config syntax:
docker run grafana/loki:2.9.0 -config.file=/etc/loki/local-config.yaml -verify-config - •Ensure retention settings don't exceed disk capacity
- •For S3: verify IAM permissions and bucket access
Step 3: Configure Promtail for Log Shipping
Set up Promtail to scrape logs and forward to Loki with label extraction.
Promtail configuration (promtail-config.yml):
server:
http_listen_port: 9080
grpc_listen_port: 0
positions:
filename: /tmp/positions.yaml
clients:
- url: http://loki:3100/loki/api/v1/push
batchwait: 1s
batchsize: 1048576 # 1MB
backoff_config:
min_period: 500ms
max_period: 5m
max_retries: 10
timeout: 10s
scrape_configs:
# System logs
- job_name: system
static_configs:
- targets:
- localhost
labels:
job: varlogs
host: ${HOSTNAME}
__path__: /var/log/*.log
# Docker container logs
- job_name: containers
docker_sd_configs:
- host: unix:///var/run/docker.sock
refresh_interval: 5s
relabel_configs:
- source_labels: [__meta_docker_container_name]
regex: '/(.*)'
target_label: container
- source_labels: [__meta_docker_container_log_stream]
target_label: stream
pipeline_stages:
# Parse JSON logs from containers
- json:
expressions:
level: level
message: message
timestamp: timestamp
trace_id: trace_id
# Extract log level to label
- labels:
level:
trace_id:
# Parse timestamp
- timestamp:
source: timestamp
format: RFC3339Nano
# Extract structured fields
- output:
source: message
# Application logs with regex parsing
- job_name: app-logs
static_configs:
- targets:
- localhost
labels:
job: app
env: production
__path__: /var/log/app/*.log
pipeline_stages:
# Parse log line: "2024-01-15 10:30:45 [ERROR] user_service: Failed to authenticate user 12345"
- regex:
expression: '^(?P<timestamp>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}) \[(?P<level>\w+)\] (?P<service>\w+): (?P<message>.*)$'
- labels:
level:
service:
- timestamp:
source: timestamp
format: '2006-01-02 15:04:05'
- output:
source: message
# Nginx access logs
- job_name: nginx
static_configs:
- targets:
- localhost
labels:
job: nginx
__path__: /var/log/nginx/access.log
pipeline_stages:
- regex:
expression: '^(?P<remote_addr>[\w\.]+) - (?P<remote_user>[\w-]+) \[(?P<timestamp>.*)\] "(?P<method>\w+) (?P<path>[^\s]+) (?P<protocol>[^"]+)" (?P<status>\d+) (?P<body_bytes_sent>\d+) "(?P<http_referer>[^"]*)" "(?P<http_user_agent>[^"]*)"$'
- labels:
method:
status:
- timestamp:
source: timestamp
format: '02/Jan/2006:15:04:05 -0700'
# Kubernetes pod logs
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_node_name]
target_label: node
- source_labels: [__meta_kubernetes_namespace]
target_label: namespace
- source_labels: [__meta_kubernetes_pod_name]
target_label: pod
- source_labels: [__meta_kubernetes_pod_container_name]
target_label: container
- replacement: /var/log/pods/*$1/*.log
separator: /
source_labels:
- __meta_kubernetes_pod_uid
- __meta_kubernetes_pod_container_name
target_label: __path__
Key Promtail concepts:
- •Scrape configs: Define log sources and how to discover them
- •Pipeline stages: Transform and label logs before sending to Loki
- •Relabel configs: Dynamic labeling based on metadata
- •Positions file: Tracks read offsets to avoid re-processing logs
Expected: Promtail scrapes configured log files, labels applied correctly, logs visible in Loki via LogQL queries.
On failure:
- •Check Promtail logs:
docker logs promtail - •Verify file paths are accessible:
docker exec promtail ls /var/log - •Test regex patterns independently with sample log lines
- •Monitor Promtail metrics:
curl http://localhost:9080/metrics | grep promtail - •Check positions file for progress:
cat /tmp/positions.yaml
Step 4: Query Logs with LogQL
Learn LogQL syntax for filtering and aggregating logs.
Basic queries:
# All logs from a job
{job="app"}
# Logs with specific label values
{job="app", level="error"}
# Regex filter on log line content
{job="app"} |~ "authentication failed"
# Case-insensitive regex
{job="app"} |~ "(?i)error"
# Line filter (doesn't parse, just includes/excludes)
{job="app"} |= "user" # Contains "user"
{job="app"} != "debug" # Doesn't contain "debug"
Parsing and filtering:
# JSON parsing
{job="app"} | json | level="error"
# Regex parsing with named groups
{job="app"} | regexp "user_id=(?P<user_id>\\d+)" | user_id="12345"
# Logfmt parsing (key=value format)
{job="app"} | logfmt | level="error", service="auth"
# Pattern parsing
{job="nginx"} | pattern `<ip> - <user> [<timestamp>] "<method> <path> <protocol>" <status> <size>` | status >= 500
Aggregations (metrics from logs):
# Count log lines per level
sum by (level) (count_over_time({job="app"}[5m]))
# Rate of error logs
rate({job="app", level="error"}[5m])
# Bytes processed per service
sum by (service) (bytes_over_time({job="app"}[1h]))
# Average request duration from logs
avg_over_time({job="app"} | json | unwrap duration [5m])
# Top 10 error messages
topk(10, sum by (message) (count_over_time({level="error"} [1h])))
Filtering by extracted fields:
# Find specific trace in logs
{job="app"} | json | trace_id="abc123def456"
# HTTP 5xx errors from nginx
{job="nginx"} | pattern `<_> "<_> <_> <_>" <status> <_>` | status >= 500
# Failed authentication attempts
{job="app"} | json | message=~"authentication failed" | user_id != ""
Create Grafana explore queries or dashboard panels using these patterns.
Expected: Queries return expected log lines, filtering works correctly, aggregations produce metrics from logs.
On failure:
- •Use Grafana Explore to debug queries interactively
- •Check label names:
curl http://localhost:3100/loki/api/v1/labels - •Verify label values:
curl http://localhost:3100/loki/api/v1/label/{label_name}/values - •Simplify query: start with basic label selector, add filters incrementally
- •Check time range: logs might not exist in selected window
Step 5: Integrate Logs with Metrics and Traces
Correlate logs with Prometheus metrics and distributed traces for unified observability.
Add trace IDs to logs (application instrumentation):
# Python with OpenTelemetry
import logging
from opentelemetry import trace
logger = logging.getLogger(__name__)
def handle_request():
span = trace.get_current_span()
trace_id = span.get_span_context().trace_id
logger.info(
"Processing request",
extra={"trace_id": format(trace_id, "032x")}
)
// Go with OpenTelemetry
import (
"go.opentelemetry.io/otel/trace"
"go.uber.org/zap"
)
func handleRequest(ctx context.Context) {
span := trace.SpanFromContext(ctx)
traceID := span.SpanContext().TraceID().String()
logger.Info("Processing request",
zap.String("trace_id", traceID),
)
}
Configure Grafana data links from metrics to logs:
In Prometheus panel field config:
{
"fieldConfig": {
"defaults": {
"links": [
{
"title": "View Logs",
"url": "/explore?left={\"datasource\":\"Loki\",\"queries\":[{\"refId\":\"A\",\"expr\":\"{job=\\\"app\\\",instance=\\\"${__field.labels.instance}\\\"} |= `${__field.labels.trace_id}`\"}],\"range\":{\"from\":\"${__from}\",\"to\":\"${__to}\"}}",
"targetBlank": false
}
]
}
}
}
Configure Grafana data links from logs to traces:
In Loki datasource config:
datasources:
- name: Loki
type: loki
url: http://loki:3100
jsonData:
derivedFields:
- datasourceName: Tempo
matcherRegex: "trace_id=(\\w+)"
name: TraceID
url: "$${__value.raw}"
Correlate logs in Grafana Explore:
- •Query metrics in Prometheus
- •Click on data point
- •Select "View Logs" from context menu
- •Loki query auto-populated with relevant labels and time range
- •Click trace ID in logs
- •Tempo trace view opens with full distributed trace
Expected: Clicking metrics opens related logs, trace IDs in logs link to trace viewer, single pane for metrics/logs/traces navigation.
On failure:
- •Verify trace ID format matches regex in derived fields
- •Check that trace_id label extracted by Promtail pipeline
- •Ensure Tempo datasource configured in Grafana
- •Test URL encoding for complex filter expressions
- •Validate data link URLs in incognito/private browser window
Step 6: Set Up Log Retention and Compaction
Configure retention policies and compaction to manage storage costs.
Retention by stream (in Loki config):
limits_config:
retention_period: 720h # Global default: 30 days
# Per-tenant retention (requires multi-tenancy enabled)
per_tenant_override_config: /etc/loki/overrides.yaml
# overrides.yaml
overrides:
production:
retention_period: 2160h # 90 days for production
staging:
retention_period: 360h # 15 days for staging
development:
retention_period: 168h # 7 days for dev
Retention by stream labels (requires compactor):
compactor:
working_directory: /loki/compactor
shared_store: filesystem
compaction_interval: 10m
retention_enabled: true
retention_delete_delay: 2h
retention_delete_worker_count: 150
limits_config:
retention_stream:
- selector: '{job="debug"}'
priority: 1
period: 24h # Keep debug logs for 1 day
- selector: '{level="error"}'
priority: 2
period: 2160h # Keep error logs for 90 days
- selector: '{env="production"}'
priority: 3
period: 720h # Keep production logs for 30 days
Priority determines which rule applies when multiple match (lower number = higher priority).
Compression settings:
chunk_store_config:
chunk_cache_config:
enable_fifocache: true
fifocache:
max_size_bytes: 1GB
ttl: 24h
storage_config:
boltdb_shipper:
# Compress index files
index_cache_validity: 1h
# For S3 storage
aws:
s3: s3://bucket/prefix
s3forcepathstyle: true
# Enable compression in transit
insecure: false
# Enable query result caching
query_range:
results_cache:
cache:
enable_fifocache: true
fifocache:
max_size_bytes: 500MB
ttl: 1h
Monitor retention:
# Check chunk stats curl http://localhost:3100/loki/api/v1/status/chunks | jq # Check compactor metrics curl http://localhost:3100/metrics | grep loki_compactor # Verify deleted chunks curl http://localhost:3100/metrics | grep loki_boltdb_shipper_retention_deleted
Expected: Old logs automatically deleted per retention policy, storage usage stabilizes, compaction reduces index size.
On failure:
- •Enable compactor in Loki config if retention not working
- •Check compactor logs:
docker logs loki | grep compactor - •Verify retention_enabled: true and retention_deletes_enabled: true
- •Monitor disk usage:
du -sh /loki/ - •For S3: check bucket lifecycle policies don't conflict with Loki retention
Validation
- • Loki API health check returns 200:
curl http://localhost:3100/ready - • Promtail successfully scraping logs from all configured sources
- • Labels extracted correctly from log lines (visible in Grafana Explore)
- • LogQL queries return expected results with proper filtering
- • Log retention policy enforced (old logs deleted after retention period)
- • Logs accessible from Grafana dashboards and Explore view
- • Trace IDs from logs link to Tempo trace viewer
- • Metrics panels have data links to relevant logs
- • Compaction running and reducing storage overhead
- • Storage usage within allocated disk/S3 budget
Common Pitfalls
- •High cardinality labels: Using unbounded label values (user IDs, request IDs) causes index explosion. Use fixed labels (level, service, env) and put variables in log lines.
- •Missing log parsing: Sending raw logs without label extraction limits query capabilities. Always parse structured logs (JSON, logfmt) or use regex for unstructured.
- •Incorrect time parsing: Mismatched timestamp formats cause logs to be out of order or rejected. Test timestamp parsing with sample logs.
- •Retention not working: Compactor must be enabled for retention to delete old data. Check
retention_enabled: trueandretention_deletes_enabled: true. - •Ingestion rate limits: Default limits (10MB/s) may be too low for high-volume systems. Adjust
ingestion_rate_mbandingestion_burst_size_mb. - •Query timeouts: Broad queries over long time ranges can timeout. Use more specific label selectors and shorter time windows.
- •Log duplication: Multiple Promtail instances scraping same logs create duplicates. Use unique labels or positions file coordination.
Related Skills
- •
correlate-observability-signals- Unified debugging across metrics, logs, and traces using trace IDs - •
build-grafana-dashboards- Visualize log-derived metrics and create log panels in dashboards - •
setup-prometheus-monitoring- Metrics provide context for when to query logs during incidents - •
instrument-distributed-tracing- Add trace IDs to logs for correlation with distributed traces