A postmortem around sporadic flapping nodes in my ClusterAPI management cluster.
Issue summary
After pivoting management of cluster room101-a7d-mc
from the bootstrap cluster to the management cluster, nodes began flapping and intermittently dropping offline.
➜ talosctl get nodestatuses
NODE NAMESPACE TYPE ID VERSION READY UNSCHEDULABLE
172.25.100.10 k8s NodeStatus room101-a7d-mc-cp-e2cc5a-2cwsv 80 true false
172.25.100.11 k8s NodeStatus room101-a7d-mc-cp-e2cc5a-mw5js 47 false false
172.25.100.12 k8s NodeStatus room101-a7d-mc-cp-e2cc5a-rq5cw 52 true false
172.25.100.13 k8s NodeStatus room101-a7d-mc-workers-gvh48-v6hrz 178 true false
172.25.100.14 k8s NodeStatus room101-a7d-mc-workers-gvh48-k6v2r 97 true false
Timeline
21-04-2025
: CAPI MC was promoted to self-managing (pivoted from bootstrap cluster)21-04-2025
: HelmRelease forkube-prometheus-stack
was applied (maybe related)21-04-2025
: nodes start being flagged asNotReady
22-04-2025
:apiserver
was observed being oomkilled on a master
Logs
172.25.100.11: kern: err: [2025-04-22T10:37:36.727637155Z]: Out of memory: Killed process 2735 (kube-apiserver) total-vm:3112352kB, anon-rss:1505564kB, file-rss:104kB, shmem-rss:0kB, UID:65534 pgtables:3800kB oom_score_adj:827
Root cause
It was observed that an affected node was seeing OOMKill events on the api-server
process.
Resolution and recovery
Increase the control plane RAM allocation from 3Gb to 5Gb and the worker node RAM allocation from 2Gb to 4Gb.
Future preventative measures
It would be easier to diagnose issues such as this with centralised log aggregation from the nodes. This could be achieved by setting up a log server and configuring the nodes to ship logs to it.