MicroK8s Janitor
This skill manages the lifecycle of a MicroK8s cluster, specifically focusing on safe, rolling upgrades of nodes to ensure high availability is maintained throughout the process.
Prerequisites
- •SSH access to at least one node in the cluster (the "seed" node).
- •Passwordless
sudoor SSH key-based authentication for the user. - •The
microk8ssnap must be installed on the target nodes.
Core Workflows
1. Cluster Discovery & Environment Setup
Starting from a single seed node provided by the user, the janitor discovers the full cluster state.
- •Seed Connection: Connect to the seed node and run
microk8s kubectl get nodes -o json. - •Node Mapping: Parse the output to identify all nodes, their roles, and current statuses.
- •Connectivity Check: Verify SSH and
sudoaccess to every node in the cluster. - •Channel Discovery: Run
snap info microk8son the seed node to list available tracking channels (e.g.,1.28/stable,latest/edge).
2. Pre-flight Checks (The "Dry Run")
Before any disruptive action, the janitor ensures the cluster is healthy:
- •All nodes must be in the
Readystatus. - •Check for any "critical" pods (e.g.,
Longhorn,Calico,CoreDNS) that are currently in a non-running state. - •Verify that
dqlite(the HA backend) has a healthy quorum.
3. Interactive Planning
Present the plan to the user:
- •Current version vs. Target channel.
- •The order of nodes to be upgraded.
- •Ask for confirmation before proceeding.
4. Rolling Upgrade Loop (Sequential Execution)
For each node in the sequence:
- •Cordon:
microk8s kubectl cordon <node-name> - •Drain:
microk8s kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data --force - •Upgrade:
sudo snap refresh microk8s --channel=<channel> - •Wait for Ready: Poll
microk8s status --wait-ready(max 5 minutes). - •Health Check: Verify local pods are starting and node status is
Ready. - •Uncordon:
microk8s kubectl uncordon <node-name>
5. Resume & Recovery
If a step fails:
- •Abort: Stop immediately. Do not move to the next node.
- •Log: Capture and display the error from the node.
- •State Check: On re-invocation, the janitor detects if any nodes are still cordoned and offers to "Resume" the upgrade from the failed node.
Best Practices
- •Quorum First: Never upgrade more than one node at a time in a 3-node HA cluster to avoid losing quorum.
- •Drain Timeout: If a drain hangs, report the specific pod causing the delay to the user.
- •Snap Rollback: If
snap refreshfails, attemptssnap revert microk8sif appropriate.