Device Health Check
Perform comprehensive health assessments on network devices using pyATS. This skill defines the systematic approach for evaluating device health across all critical dimensions.
When to Use
- •Proactive daily/weekly health monitoring
- •Pre-change and post-change validation
- •Incident response — first thing you run when alerted
- •Capacity planning and trending
- •Compliance checks for operational readiness
Health Check Procedure
Always run health checks in this exact order. Each section builds on the previous one.
Step 1: Device Identity & Uptime
Run show version to establish baseline identity.
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R1","command":"show version"}'
Extract and report:
- •Hostname, model, serial number
- •IOS-XE version and image filename
- •Uptime (flag if < 24 hours — indicates recent reload)
- •Last reload reason (flag if unexpected: crash, power failure)
- •Total/available memory
- •License status
Thresholds:
- •Uptime < 24h → WARNING: Recent reload
- •Uptime < 1h → CRITICAL: Very recent reload, check for crash
- •Last reload reason contains "crash" or "error" → CRITICAL
Step 2: CPU Utilization
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R1","command":"show processes cpu sorted"}'
Thresholds (5-second / 1-minute / 5-minute averages):
- •< 50% → HEALTHY
- •50-75% → WARNING: Elevated CPU
- •75-90% → HIGH: Investigate top processes
- •
90% → CRITICAL: Immediate investigation required
Top processes to watch:
- •
IP Input— high traffic volume or routing loops - •
BGP Router/BGP I/O— large BGP table or instability - •
OSPF-1 Hello— OSPF adjacency issues - •
Crypto IKMP/Crypto Engine— IPsec overhead - •
SNMP ENGINE— polling storm - •
ARP Input— ARP storm or L2 loop
Step 3: Memory Utilization
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R1","command":"show processes memory sorted"}'
Also run:
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R1","command":"show platform resources"}'
Thresholds:
- •Used < 70% → HEALTHY
- •70-85% → WARNING: Memory pressure
- •85-95% → HIGH: May impact routing table updates
- •
95% → CRITICAL: Risk of process crashes or OOM
Memory consumers to watch:
- •
BGP Router— large BGP table (full internet table = ~1M routes) - •
CEF process— large FIB - •
OSPF Router— large OSPF LSDB - •
HTTP CORE— web server / RESTCONF overhead - •
IOSD iomem— I/O memory for packet buffers
Step 4: Interface Status
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R1","command":"show ip interface brief"}'
Then for each active interface, get detailed counters:
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R1","command":"show interfaces"}'
Report for each interface:
- •Admin status (up/down) and protocol status (up/down)
- •IP address and subnet
- •Speed, duplex, MTU
- •Input/output rate (bps and pps)
- •Error counters: CRC, input errors, output errors, drops, overruns
- •Resets counter (flag if incrementing — indicates flapping)
- •Last input/output timestamps
Flags:
- •Interface up/down → WARNING: Check physical or protocol
- •CRC errors > 0 → WARNING: Physical layer issue (cabling, optics, duplex mismatch)
- •Input errors incrementing → WARNING: Packet corruption
- •Output drops > 0 → WARNING: Congestion or QoS issue
- •Resets incrementing → CRITICAL: Interface flapping
- •Line protocol down on configured interface → CRITICAL
Step 5: Hardware & Environment
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R1","command":"show inventory"}'
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R1","command":"show platform"}'
Report: Module status (ok/fail), serial numbers, PID, transceiver types and DOM readings.
Step 6: NTP Synchronization
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R1","command":"show ntp associations"}'
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R1","command":"show clock"}'
Flags:
- •No NTP peer synchronized (no
*in associations) → CRITICAL for logging/forensics - •Clock offset > 100ms → WARNING
- •Clock offset > 1s → CRITICAL
- •No NTP configured at all → CRITICAL
Step 7: System Logs
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_show_logging '{"device_name":"R1"}'
Scan for these patterns:
- •
%SYS-*-RELOAD— reload events - •
%LINEPROTO-5-UPDOWN— interface flaps - •
%OSPF-*-ADJCHG— OSPF adjacency changes - •
%BGP-*-ADJCHANGE— BGP peer state changes - •
%DUAL-*-NBRCHANGE— EIGRP neighbor changes - •
%SYS-2-MALLOCFAIL— memory allocation failure (CRITICAL) - •
%SYS-3-CPUHOG— process monopolizing CPU (HIGH) - •
%TRACKING-*— IP SLA or object tracking changes - •
%SEC-*/%AUTHMGR-*— security events - •
%PLATFORM-*-CRASH— crash events (CRITICAL) - •
Traceback— software bug (CRITICAL — open TAC case)
Step 8: Connectivity Validation
Test reachability to critical infrastructure:
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_ping_from_network_device '{"device_name":"R1","command":"ping 8.8.8.8 repeat 5"}'
Thresholds:
- •100% success, RTT < 50ms → HEALTHY
- •100% success, RTT > 100ms → WARNING: High latency
- •80-99% success → WARNING: Packet loss
- •< 80% success → CRITICAL: Significant packet loss
- •0% success → CRITICAL: No reachability
Health Report Format
Always produce a summary table:
Device: R1 (devnetsandboxiosxec8k.cisco.com) Model: C8000V | IOS-XE: 17.x.x | Uptime: XXd XXh ┌──────────────────┬──────────┬─────────────────────────┐ │ Check │ Status │ Details │ ├──────────────────┼──────────┼─────────────────────────┤ │ CPU (5min avg) │ HEALTHY │ 12% │ │ Memory │ HEALTHY │ 45% used (1.2G/2.6G) │ │ Interfaces │ WARNING │ Gi2 down/down │ │ Hardware │ HEALTHY │ All modules OK │ │ NTP │ HEALTHY │ Synced, offset 2ms │ │ Logs │ WARNING │ 3 OSPF adjacency flaps │ │ Connectivity │ HEALTHY │ 100% to 8.8.8.8, 23ms │ └──────────────────┴──────────┴─────────────────────────┘ Overall: WARNING — 2 items need attention
Severity order: CRITICAL > HIGH > WARNING > HEALTHY. Overall status = worst individual status.
NetBox Cross-Reference (MISSION02 Enhancement)
When NetBox is available ($NETBOX_MCP_SCRIPT is set), cross-reference device state against the source of truth after Steps 1 and 4:
Interface State Validation
Query NetBox for expected interface states:
python3 $MCP_CALL "python3 -u $NETBOX_MCP_SCRIPT" netbox_get_objects '{"object_type":"dcim.interfaces","filters":{"device":"R1"},"brief":true}'
Compare NetBox intent vs device reality:
- •NetBox shows interface enabled but device shows down → CRITICAL: Unexpected outage
- •NetBox shows interface disabled but device shows up → WARNING: Undocumented activation
- •Interface exists on device but not in NetBox → WARNING: Undocumented interface
- •Interface in NetBox but not on device → WARNING: NetBox stale data
IP Address Validation
Query NetBox for expected IP assignments:
python3 $MCP_CALL "python3 -u $NETBOX_MCP_SCRIPT" netbox_get_objects '{"object_type":"ipam.ip-addresses","filters":{"device":"R1"}}'
Compare: Flag any IP_DRIFT where the device IP differs from NetBox.
Fleet-Wide Health (pCall)
To run health checks across ALL devices simultaneously, first list all devices:
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_list_devices
Then run Steps 1-8 on each device concurrently using multiple exec commands. Collect all results and produce a fleet summary:
┌──────────┬──────────┬──────┬────────┬──────────┬─────────────┐ │ Device │ CPU │ Mem │ Intf │ NTP │ Overall │ ├──────────┼──────────┼──────┼────────┼──────────┼─────────────┤ │ R1 │ HEALTHY │ WARN │ HEALTHY│ HEALTHY │ WARNING │ │ R2 │ HEALTHY │ OK │ CRIT │ HEALTHY │ CRITICAL │ │ SW1 │ HIGH │ OK │ HEALTHY│ CRIT │ CRITICAL │ └──────────┴──────────┴──────┴────────┴──────────┴─────────────┘
Sort devices by severity (CRITICAL first) for triage prioritization.
GAIT Audit Trail
After completing a health check, record the session in GAIT:
python3 $MCP_CALL "python3 -u $GAIT_MCP_SCRIPT" gait_record_turn '{"input":{"role":"assistant","content":"Health check completed on R1: CPU HEALTHY (12%), Memory WARNING (78%), Interfaces HEALTHY, NTP HEALTHY. Overall: WARNING.","artifacts":[]}}'