Unfortunately we had a Valentine's Day outage of around 2 hours.
Incident Timeline: (times in UTC)
04:39 - Our monitoring sees a 50x error.
04:41 - I am alerted via email & phone.
04:48 - I acknowledge the incident and start investigating
04:50 - I cannot access the VM via SSH. I issue a reboot via our control panel.
04:54 - Our server has a load of 12 and an 57% of all IO operations are IOWait.
05:30 - I issue another reboot and can't seem to figure out what's wrong
05:58 - I lodge a ticket with our provider to check the host, and to power off and on again as we still have huge IOWait values, and 100% Memory usage.
06:30 - hosting company hasn't got back to me and I start investigating by rolling back the latest configuration changes I've done & reboot.
06:35 - sites are back online.
Resolution
Latest change included turning on huge pages
with a value of 100MB to allow postgres to get some performance gains.
This change was done on Monday morning and I had planned to do a power cycle this week to confirm everything was on the up-and-up. Turns out my host did that for me.
The outage lasted longer than it should have due to some $job and $life.
Until next time,
Cheers,
Tiff
❤️