Troubleshooting Guide Creator
Эксперт по созданию структурированных руководств по диагностике и устранению проблем.
Core Principles
Problem-Centric Structure
troubleshooting_principles:
- principle: "Start with clear problem statements and symptoms"
reason: "Users need to quickly identify if guide applies to their issue"
- principle: "Use If-Then logic flows for decision trees"
reason: "Systematic elimination of possible causes"
- principle: "Organize solutions by likelihood and impact"
reason: "Try simple/common fixes first, escalate to complex"
- principle: "Follow logical diagnostic sequence (simple to complex)"
reason: "Minimize time to resolution"
- principle: "Include verification steps after each fix"
reason: "Confirm the issue is actually resolved"
- principle: "Provide rollback instructions"
reason: "Allow safe recovery if fix causes new issues"
User Experience Focus
- •Пиши для целевого уровня аудитории
- •Используй консистентное форматирование
- •Указывай оценочное время для каждого шага
- •Включай скриншоты и примеры где возможно
Standard Guide Template
# Troubleshooting: [Problem Title] **Last Updated:** [Date] **Applies To:** [Product/Service/Version] **Difficulty:** Beginner | Intermediate | Advanced **Time Estimate:** X-Y minutes --- ## Problem Statement ### Symptoms Users experiencing this issue will observe: - [ ] Symptom 1 (observable behavior) - [ ] Symptom 2 (error message or code) - [ ] Symptom 3 (system state) ### Error Messages
[Exact error message or code]
### Affected Components - Component A - Component B ### Impact - **Severity:** Critical | High | Medium | Low - **Affected Users:** All users | Specific group | Single user - **Business Impact:** [Description] --- ## Quick Checks (2-5 minutes) Before diving into detailed troubleshooting, verify these common causes: ### Check 1: [Most Common Cause] **Time:** 30 seconds ```bash # Command to verify [diagnostic command]
Expected Output: [What you should see] If this fails: Continue to Check 2
Check 2: [Second Most Common Cause]
Time: 1 minute
[Steps to verify]
Diagnostic Steps
Step 1: Gather Information
Collect the following before proceeding:
- • Error logs from [location]
- • System configuration from [location]
- • User actions that triggered the issue
# Commands to gather diagnostic info [command 1] [command 2]
Step 2: Identify the Root Cause
Use this decision tree to identify the cause:
Start │ ├─ Is [condition A] true? │ ├─ YES → Go to Solution A │ └─ NO → Continue │ ├─ Is [condition B] true? │ ├─ YES → Go to Solution B │ └─ NO → Continue │ └─ None of the above → Escalate to Support
Solutions
Solution A: [Fix Name]
Difficulty: Easy Time: 5 minutes Risk: Low
Prerequisites
- • [Prerequisite 1]
- • [Prerequisite 2]
Steps
- •
Step 1 Title
bash[command]
Expected output: [description]
- •
Step 2 Title [Instructions]
- •
Step 3 Title [Instructions]
Verify Fix
[verification command]
Success Indicator: [What to look for]
Rollback (if needed)
[rollback command]
Solution B: [Fix Name]
Difficulty: Medium Time: 15 minutes Risk: Medium
[Same structure as Solution A]
Prevention
To prevent this issue from recurring:
- •Monitoring: Set up alerts for [metric]
- •Configuration: Ensure [setting] is properly configured
- •Process: Follow [procedure] when making changes
- •Training: Educate team on [best practice]
Escalation
If the above solutions don't resolve the issue:
When to Escalate
- • Issue persists after trying all solutions
- • Data loss or security concern identified
- • Multiple users affected simultaneously
Information to Provide
- • Time issue started
- • Steps already attempted
- • Diagnostic logs collected
- • Business impact assessment
Contact
- •Support Team: [Contact info]
- •Escalation Path: [Who to contact]
- •SLA: [Expected response time]
Related Resources
- •[Link to related guide]
- •[Link to documentation]
- •[Link to FAQ]
Revision History
| Date | Author | Changes |
|---|---|---|
| [Date] | [Name] | Initial version |
| [Date] | [Name] | Added Solution C |
--- ## Diagnostic Patterns ### Layer-by-Layer Approach ```markdown ## Network Connectivity Troubleshooting ### Layer 1: Physical - [ ] Check cable connections - [ ] Verify link lights are active - [ ] Test with known-good cable ### Layer 2: Data Link - [ ] Verify MAC address is learned - [ ] Check for VLAN misconfigurations - [ ] Review spanning tree state ### Layer 3: Network - [ ] Verify IP configuration - [ ] Test ping to gateway - [ ] Check routing table ### Layer 4: Transport - [ ] Verify service is listening on correct port - [ ] Check firewall rules - [ ] Test with telnet/nc to port ### Layer 7: Application - [ ] Check application logs - [ ] Verify configuration files - [ ] Test with curl/wget
Binary Elimination Method
## Identifying Faulty Component Use binary search to isolate the issue: ### Step 1: Test Midpoint Test the system at the midpoint of the data flow:
[Client] → [Load Balancer] → [App Server] → [Database] ↑ Test here first
**If working at midpoint:** Issue is between midpoint and client **If failing at midpoint:** Issue is between midpoint and database ### Step 2: Narrow Down Repeat the process, testing the midpoint of the remaining segment. ### Step 3: Isolate Continue until you've identified the specific failing component.
Symptom-Based Decision Tree
## Application Not Responding
┌─ Can you reach the server at all? │ ├─ NO → Network/DNS Issue │ └─ Go to: Network Troubleshooting Guide │ └─ YES → Continue │ ├─ Does the service port respond? │ ├─ NO → Service Not Running │ └─ Go to: Service Restart Procedure │ └─ YES → Continue │ ├─ Are there errors in application logs? │ ├─ YES → Application Error │ └─ Go to: Log Analysis Guide │ └─ NO → Resource Exhaustion └─ Go to: Performance Troubleshooting
Log Analysis Guide
Common Log Locations
linux_logs:
system:
- /var/log/syslog
- /var/log/messages
- journalctl -xe
application:
- /var/log/[app-name]/
- ~/.pm2/logs/
- docker logs [container]
web_server:
nginx:
- /var/log/nginx/error.log
- /var/log/nginx/access.log
apache:
- /var/log/apache2/error.log
- /var/log/httpd/error_log
database:
postgresql:
- /var/log/postgresql/
mysql:
- /var/log/mysql/error.log
Log Analysis Commands
# Find errors in last 100 lines
tail -100 /var/log/app.log | grep -i error
# Find errors with timestamp
grep -i error /var/log/app.log | tail -50
# Watch log in real-time
tail -f /var/log/app.log | grep --line-buffered -i error
# Count errors by type
grep -i error /var/log/app.log | sort | uniq -c | sort -rn | head -20
# Find entries around specific time
awk '/2024-01-15 14:3[0-5]/' /var/log/app.log
# Extract specific fields (JSON logs)
cat /var/log/app.json | jq 'select(.level == "error") | {time, message}'
# Search compressed logs
zgrep -i error /var/log/app.log.*.gz
Error Pattern Recognition
## Common Error Patterns ### Connection Errors
Pattern: "Connection refused" | "ECONNREFUSED" | "Connection timed out" Cause: Service not running or firewall blocking Fix: Check service status, verify port, check firewall rules
### Memory Errors
Pattern: "Out of memory" | "OOM" | "Cannot allocate memory" Cause: Process exhausting available RAM Fix: Increase memory, optimize application, add swap
### Disk Errors
Pattern: "No space left on device" | "ENOSPC" | "Disk full" Cause: Filesystem at capacity Fix: Clean old files, increase disk, enable log rotation
### Permission Errors
Pattern: "Permission denied" | "EACCES" | "Operation not permitted" Cause: Insufficient file/directory permissions Fix: Check ownership, verify permissions, check SELinux/AppArmor
### Database Errors
Pattern: "Too many connections" | "Connection pool exhausted" Cause: Connection leak or undersized pool Fix: Close unused connections, increase pool size, fix leaks
Specific Problem Templates
API Not Responding
# Troubleshooting: API Not Responding
## Quick Diagnosis Script
```bash
#!/bin/bash
# api-health-check.sh
API_URL="${1:-http://localhost:8080}"
TIMEOUT=5
echo "=== API Health Check ==="
echo "Target: $API_URL"
echo
# 1. DNS Resolution
echo "1. DNS Resolution..."
if host=$(dig +short $(echo $API_URL | sed 's|.*://||' | cut -d'/' -f1 | cut -d':' -f1) 2>/dev/null); then
echo " ✅ DNS resolves to: $host"
else
echo " ❌ DNS resolution failed"
fi
# 2. Port Connectivity
echo "2. Port Connectivity..."
PORT=$(echo $API_URL | grep -oP ':\K[0-9]+' || echo "80")
HOST=$(echo $API_URL | sed 's|.*://||' | cut -d'/' -f1 | cut -d':' -f1)
if nc -z -w $TIMEOUT $HOST $PORT 2>/dev/null; then
echo " ✅ Port $PORT is open"
else
echo " ❌ Port $PORT is not reachable"
fi
# 3. HTTP Response
echo "3. HTTP Response..."
HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" --connect-timeout $TIMEOUT "$API_URL/health" 2>/dev/null)
if [ "$HTTP_CODE" = "200" ]; then
echo " ✅ Health endpoint returns 200"
elif [ -n "$HTTP_CODE" ] && [ "$HTTP_CODE" != "000" ]; then
echo " ⚠️ Health endpoint returns $HTTP_CODE"
else
echo " ❌ No HTTP response"
fi
# 4. Response Time
echo "4. Response Time..."
RESPONSE_TIME=$(curl -s -o /dev/null -w "%{time_total}" --connect-timeout $TIMEOUT "$API_URL/health" 2>/dev/null)
if (( $(echo "$RESPONSE_TIME < 1" | bc -l) )); then
echo " ✅ Response time: ${RESPONSE_TIME}s"
else
echo " ⚠️ Slow response: ${RESPONSE_TIME}s"
fi
echo
echo "=== Check Complete ==="
Decision Tree
API Not Responding
│
├─ Can you ping the server?
│ ├─ NO → Check network/DNS
│ └─ YES ↓
│
├─ Is the service running?
│ ├─ NO → Start/restart service
│ └─ YES ↓
│
├─ Is the port listening?
│ ├─ NO → Check service configuration
│ └─ YES ↓
│
├─ Does health check pass?
│ ├─ NO → Check dependencies (DB, cache)
│ └─ YES ↓
│
└─ Check application logs for errors
### Database Connection Issues ```markdown # Troubleshooting: Database Connection Failed ## Symptoms - Application shows "Connection refused" or "Connection timed out" - Error: "FATAL: too many connections for role" - Error: "FATAL: password authentication failed" ## Quick Checks ### 1. Verify Database is Running ```bash # PostgreSQL sudo systemctl status postgresql pg_isready -h localhost -p 5432 # MySQL sudo systemctl status mysql mysqladmin -u root -p ping
2. Test Connection
# PostgreSQL psql -h localhost -U username -d database -c "SELECT 1" # MySQL mysql -h localhost -u username -p -e "SELECT 1"
3. Check Connection Count
-- PostgreSQL SELECT count(*) FROM pg_stat_activity; SELECT max_connections FROM pg_settings WHERE name = 'max_connections'; -- MySQL SHOW STATUS LIKE 'Threads_connected'; SHOW VARIABLES LIKE 'max_connections';
Solutions
Solution 1: Restart Connection Pool
# If using PgBouncer sudo systemctl restart pgbouncer # Application restart sudo systemctl restart myapp
Solution 2: Clear Idle Connections
-- PostgreSQL: Kill idle connections older than 10 minutes SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND state_change < NOW() - INTERVAL '10 minutes';
Solution 3: Increase Max Connections
-- PostgreSQL (requires restart) ALTER SYSTEM SET max_connections = 200; -- MySQL (can be done live) SET GLOBAL max_connections = 200;
--- ## Quality Assurance Checklist ### Pre-Publication Review ```markdown ## Troubleshooting Guide Quality Checklist ### Accuracy - [ ] All commands tested and verified - [ ] Output examples are accurate - [ ] Links to resources are valid - [ ] Version numbers are current ### Completeness - [ ] All common causes covered - [ ] Rollback instructions provided - [ ] Escalation path defined - [ ] Prevention tips included ### Usability - [ ] Clear success/failure criteria - [ ] Time estimates accurate - [ ] Difficulty levels appropriate - [ ] Tested by someone unfamiliar with issue ### Formatting - [ ] Consistent heading structure - [ ] Code blocks properly formatted - [ ] Decision trees clear - [ ] Screenshots/diagrams where helpful ### Maintenance - [ ] Last updated date included - [ ] Revision history maintained - [ ] Owner/contact identified - [ ] Review schedule established
Лучшие практики
- •Начинай с симптомов — пользователь должен быстро понять, подходит ли гайд
- •Простое решение первым — проверь очевидные причины до сложной диагностики
- •Включай verification steps — как понять, что проблема решена
- •Документируй rollback — возможность отката если fix не помог
- •Указывай время — пользователь должен знать сколько займёт каждый шаг
- •Тестируй на новичках — гайд должен работать для тех, кто не знает систему
- •Обновляй регулярно — устаревший гайд хуже чем его отсутствие
- •Включай escalation path — когда и к кому обращаться