A short postmortem into why a node in my Proxmox cluster went offline twice within a few days.
Issue summary
Whilst on holiday, I received a notification that a Proxmox host was offline. I was unable to access the host via SSH and the Proxmox GUI showed the host and all resident VMs as offline.
Timeline
- 2025-03-26:
host-03
goes offline. - 2025-03-29:
host-03
is power-cycled and comes back up. - 2025-04-03:
host-03
goes offline again. - 2025-04-04: after attaching a monitor it is discovered that
bond0
is down. The networking config is corrected and the host is brought back online.
Root cause
The root cause of the issue was a misconfiguration in the network settings of host-03
. Upon attaching a monitor after the second outage it was discovered that some spurious output from the ip
command (see below) had somehow been appended to /etc/network/interfaces
. I do not understand how the host was able to initially bring the networks up after booting, but the working theory as to why the node dropped offline is that an automated networking reload was tripped up by this and failed to bring bond0
up correctly.
auto bond0
iface bond0 inet manual
ovs_bonds enp1s0 enx60a4b758ba5b
ovs_type OVSBond
ovs_bridge vmbr0
ovs_options vlan_mode=native-untagged lacp=active bond_mode=balance-tcp tag=1
auto vmbr0
iface vmbr0 inet manual
ovs_type OVSBridge
ovs_ports bond0 vlan1 vlan1001 vlan1002 vlan1111 vlan1000 vlan1100 vlan1200 vlan1101 vlan1102
27: enx60a4b758ba5b: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master ovs-system state UP group default qlen 1000
Resolution and recovery
Removal of the spurious output from /etc/network/interfaces
and a restart of the networking service brought host-03
back online.
Corrective and preventative measures
Be more careful with terminal redirection.