Collector Troubleshooting
Systematic approach to debugging Sawmills collector issues.
Prerequisites
- •aws-sso-login skill: AWS authentication
- •customer-config skill: View configs
- •remote-operator-debug skill: Run commands on customer collectors
Workflow
Step 1: Gather Context
Ask for:
- •Org ID (clerk org)
- •Environment (prod/staging)
- •Error description or symptoms
- •Collector/deployment name if known
Step 2: Check Collector Status
bash
remote-operator -a <ro_address> -o <org_id> manage ls
Look for:
- •Pod count (expected vs actual)
- •Container status (3/3 Running)
- •Recent restarts
Step 3: Check Logs for Errors
bash
# Main collector logs remote-operator -a <ro_address> -o <org_id> manage run \ -d <deployment> --instance-name <instance> \ -- kubectl logs -n sawmills deployment/sawmills-collector -c main-collector --tail 200 # HAProxy logs (connection issues) remote-operator -a <ro_address> -o <org_id> manage run \ -d <deployment> --instance-name <instance> \ -- kubectl logs -n sawmills deployment/sawmills-collector -c haproxy --tail 100
Step 4: Identify Error Category
| Error Pattern | Likely Cause | Next Step |
|---|---|---|
Unauthenticated | Wrong endpoint (staging vs prod) | Check config endpoints |
route not defined | Missing routereceiver config | Check pipeline config |
ResourceExhausted | gRPC message too large | Server-side limit increase |
connection refused | Service down or wrong port | Check endpoint URLs |
deadline exceeded | Network/timeout issues | Check connectivity |
Step 5: Check Config Endpoints
Use s3-config-editor skill to decrypt and inspect config.
Key endpoints to verify (prod):
yaml
# Correct prod endpoints livetail: livetail-ingest.ue1.prod.plat.sm-svc.com:443 telemetry_bucket: sawmills-plat-ue1-prod-telemetry-data gateway: https://ingest.sawmills.ai:443 prometheus: https://ingress.sawmills.ai/api/v1/push
Common mistake: staging endpoints in prod config.
Step 6: Compare with Working Customer
Use config-diff skill to compare with a known working customer (e.g., BigID).
bash
# Decrypt both configs and diff
diff <(go run ./cmd/encrypt -decrypt -s3 "<customer_s3>") \
<(go run ./cmd/encrypt -decrypt -s3 "<reference_s3>")
Step 7: Check Helm Values
bash
remote-operator -a <ro_address> -o <org_id> manage run \ -d <deployment> --instance-name <instance> \ -- helm get values sawmills-collector -n sawmills
Verify:
- •
collector_gateway.endpoint - •
prometheusremotewrite.endpoint - •API keys present
Step 8: Fix or Escalate
| Issue Type | Action |
|---|---|
| Config endpoint wrong | Use s3-config-editor to fix |
| Helm values wrong | Use remote-operator to update |
| Server-side issue | Create Linear ticket |
| Unknown | Consult team, attach logs |
Common Issues Reference
Staging vs Prod Mixup
Symptoms: Unauthenticated errors
Fix: Replace all staging with prod in config endpoints
Route Not Defined
Symptoms: logstometricsprocessor errors
Cause: Pipeline config missing metrics route
Fix: Pipeline config regeneration (pipelines-service)
gRPC Message Too Large
Symptoms: ResourceExhausted: message larger than max
Cause: Batch size exceeds 4MB gRPC limit
Fix: Server-side limit increase or reduce batch size
Notes
- •Always check if Linear issue exists before creating new one
- •Save decrypted configs to
/tmp/for comparison - •Collector pods should be 3/3 Running with recent start time after deploy