[L66#2529874] Sessions with RS1 flapped
Incident Report for level66.network
Postmortem

Please find below the RFO for the RS1 session flaps.

Tue 13:58 - Reloaded network settings on RS1 in stead of new RS2, this caused the interface to reset and IPv4 and IPv6 sessions to flap
Tue 14:05 - Received alert, started investigation
Tue 14:20 - All sessions restored
Tue 14:22 - Sent message to mailing list
Tue 17:30 - Discovered IPv4 link local address on interface
Tue 17:40 - Removed address, this caused BIRD to update its interface's primary address (unknown than) and IPv4 sessions to flap
Tue 17:43 - Noticed flaps, started investigation
Tue 17:58 - Sent message to mailing list
Tue 18:01 - All sessions restored
Wed 15:22 - Reloaded network settings on new RS2, this started IPv4 link local address selection on new RS2 which sent ARP requests for verification
Wed 15:22 - This in turn triggered RS1 to do the same which then assigned itself an additional IPv4 address, which again caused IPv4 sessions to flap
Wed 15:30 - Noticed flaps, started investigation on RS1
Wed 15:42 - All sessions restored
Wed 15:47 - Sent message to mailing list
Wed 16:57 - No cause found yet, decided to postpone RS2 maintenance because of unpredictability and sent another announcement
Wed 17:30 - Started packet trace investigation which in the end showed the IPv4 link local issue

To avoid this in the future we're planning a number of changes for which we'll send a separate announcement.

  • Make hostnames more different.
  • Revert changes introduced by system upgrade, being:
  • Change network config manager from Netplan back to Ifupdown for better control.
  • Disable IPv4 link local addressing.
  • Hard configure BIRD's primary interface address.

If you have any questions don't hesitate to contact us.

Posted Feb 23, 2024 - 15:22 UTC

Resolved
Sessions with RS1 (193.239.116.255 and 2001:7f8:13::a503:4307:1) just
flapped. This was due to a mistake while preparing for the maintenance
on RS2 tomorrow. We sincerely apologize. We will investigate how to
avoid this in the future.
Posted Feb 13, 2024 - 01:30 UTC