Slack Incident Workflow
Coordinate network incident response through Slack using NetClaw's App Agent capabilities. This skill defines structured workflows for incident detection, triage, investigation, resolution, and post-incident review — all conducted through Slack threads and channels.
Slack OAuth Scopes Used
| Scope | Purpose |
|---|---|
assistant:write | Act as App Agent in incident threads |
chat:write | Post incident updates |
channels:join | Join incident channels |
channels:history | Read channel context for investigation |
groups:history | Access private incident channels |
groups:read | View private channel info |
pins:read | Reference pinned runbooks/procedures |
bookmarks:read | Access saved incident resources |
bookmarks:write | Save incident artifacts |
files:write | Attach logs, configs, diagrams |
reactions:write | Track incident status via reactions |
users:read | Identify on-call engineers |
users.profile:read | Check engineer availability |
dnd:read | Respect Do Not Disturb before paging |
Incident Lifecycle in Slack
Phase 1: Detection & Declaration
When a critical alert triggers (from slack-network-alerts skill or human report):
:rotating_light: *INCIDENT DECLARED — Network Outage* *Severity:* P1 — Service Impacting *Detected:* 2024-02-21 14:32 UTC *Reporter:* NetClaw (automated) / @engineer1 (manual) *Symptoms:* • R1 unreachable (ping 0%) • 47 downstream routes lost • 3 OSPF adjacencies down • BGP peer to ISP: IDLE *Impact:* • Site A has no WAN connectivity • Estimated affected users: ~200 *Incident Commander:* [awaiting claim — react with :raised_hand: to take IC] *ServiceNow:* [CR/INC pending] ━━━ *All investigation updates in this thread* ━━━
Phase 2: Triage & Assignment
When an engineer reacts with :raised_hand::
:busts_in_silhouette: *Incident Team Formed* *IC:* @engineer1 (claimed at 14:35 UTC) *NetClaw:* Automated investigation assistant *Triage Checklist:* :white_check_mark: Alert generated and posted :white_check_mark: Incident declared (P1) :white_large_square: IC assigned → :white_check_mark: @engineer1 :white_large_square: ServiceNow incident created :white_large_square: Upstream device checked :white_large_square: Blast radius confirmed :white_large_square: Customer communication sent _NetClaw beginning automated investigation..._
Phase 3: Automated Investigation
NetClaw runs diagnostics and posts results in the thread:
:mag: *Automated Investigation — Step 1/4*
_Checking upstream device R2 for connectivity to R1..._
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL \
"python3 -u $PYATS_MCP_SCRIPT" pyats_ping_from_network_device \
'{"device_name":"R2","command":"ping 10.1.1.1 repeat 10"}'
Post each step result:
:mag: *Investigation Results — Step 1/4* *Ping from R2 to R1 (10.1.1.1):* 0% success — R1 unreachable from upstream :mag: *Investigation Results — Step 2/4* *R2 interface Gi1 (toward R1):* up/up, 0 CRC errors, last input 4 min ago → Physical layer looks OK from R2 side :mag: *Investigation Results — Step 3/4* *R2 OSPF neighbors:* R1 missing from neighbor table (was FULL) → OSPF adjacency lost, DR election may be in progress :mag: *Investigation Results — Step 4/4* *R2 logs (last 30 min):*
14:31:47: %OSPF-5-ADJCHG: Nbr 1.1.1.1 on Gi1 from FULL to DOWN 14:31:48: %LINEPROTO-5-UPDOWN: Line protocol on Gi1, changed to down 14:32:01: %LINEPROTO-5-UPDOWN: Line protocol on Gi1, changed to up 14:32:15: %OSPF-5-ADJCHG: Nbr 1.1.1.1 on Gi1 from DOWN to INIT
*Analysis:* R2 saw Gi1 flap at 14:31. Line protocol came back up but OSPF hasn't re-converged. Likely physical issue on R1 side causing interface bounce.
Phase 4: Status Updates
Post periodic status updates:
:hourglass_flowing_sand: *Status Update — 14:50 UTC (18 min elapsed)* *Status:* Investigating *Finding:* R1 appears to have reloaded unexpectedly. R2 sees the link recover but R1 is not responding to OSPF hellos yet. Possible crash or power event. *Next Step:* Waiting for R1 to complete boot sequence. Checking console access. *ETA:* Unknown — dependent on R1 recovery _ServiceNow INC0012345 updated_
Phase 5: Resolution
:white_check_mark: *INCIDENT RESOLVED* *Duration:* 34 minutes (14:32 — 15:06 UTC) *Resolution:* R1 experienced a software crash (Traceback in logs). Device auto-reloaded and recovered. All OSPF adjacencies re-established. Full routing restored. *Post-Resolution Verification:* • R1 reachable: :white_check_mark: 100% ping success • OSPF neighbors: :white_check_mark: 3/3 FULL • BGP peer: :white_check_mark: Established • Route count: :white_check_mark: 47 routes (matches baseline) • Connectivity: :white_check_mark: 100% to all targets *Root Cause:* Software crash — Traceback found in logs indicating bug CSCxx12345. TAC case recommended. *ServiceNow:* INC0012345 resolved *GAIT:* Session abc123 closed
Phase 6: Post-Incident Review
:clipboard: *Post-Incident Review — Scheduled* *Incident:* Network Outage — R1 crash *Date:* 2024-02-22 10:00 UTC *Channel:* This thread *Review Artifacts (attached):* 1. :page_facing_up: Timeline of events 2. :page_facing_up: R1 show logging output 3. :page_facing_up: R1 show version (confirms reload reason) 4. :page_facing_up: GAIT audit trail (full session) 5. :page_facing_up: Pre/post health check comparison *Discussion Topics:* • Was detection fast enough? • Was automated investigation helpful? • What monitoring gaps exist? • Should R1 be upgraded to patched version? • Do we need redundant path for this link?
Escalation Matrix
:arrow_up: *Escalation Guide* │ Severity │ Notify │ Escalate After │ Channel │ │ P1 │ IC + Manager + NOC │ 15 min │ #incidents │ │ P2 │ IC + Team │ 30 min │ #netclaw-alerts │ │ P3 │ Assigned engineer │ 4 hours │ #netclaw-alerts │ │ P4 │ Queue only │ Next business │ #netclaw-general │
Before escalating, check DND status:
- •If engineer has DND active, escalate to next person in rotation
- •Never suppress P1 escalation for DND
Reaction-Based Status Tracking
| Reaction | Status | Meaning |
|---|---|---|
| :rotating_light: | Declared | Incident is active |
| :raised_hand: | Claimed | IC has taken ownership |
| :mag: | Investigating | Active investigation |
| :wrench: | Fixing | Fix being applied |
| :hourglass: | Waiting | Waiting on external (vendor, ISP) |
| :white_check_mark: | Resolved | Incident resolved |
| :bookmark: | PIR Scheduled | Post-incident review planned |
ServiceNow Integration
Create ServiceNow incident at Phase 1:
python3 $MCP_CALL "python3 -u $SERVICENOW_MCP_SCRIPT" create_incident \
'{"short_description":"P1 - R1 unreachable, WAN outage Site A","description":"R1 is unreachable. 47 routes lost, 3 OSPF adjacencies down. Impact: ~200 users at Site A without WAN connectivity.","urgency":"1","impact":"1","category":"Network"}'
Update ServiceNow as incident progresses and close on resolution.
GAIT Audit Trail
Record every phase in GAIT:
python3 $MCP_CALL "python3 -u $GAIT_MCP_SCRIPT" gait_record_turn \
'{"input":{"role":"assistant","content":"INCIDENT P1: R1 unreachable. Phase 1 declared, Phase 2 IC assigned @engineer1, Phase 3 automated investigation shows R1 crash, Phase 4 monitoring recovery, Phase 5 resolved after 34 min. INC0012345 closed.","artifacts":[]}}'