What can Notre Dame teach us about site reliability?

Back in 2019 Notre Dame suffered a horrific fire. The New York Times did one the best interactive articles I have seen chronicling the events that took place

https://www.nytimes.com/interactive/2019/07/16/world/europe/notre-dame.html

During the same time our team was undergoing a reorganization and we were building a new SRE team. After reviewing the article, I started to notice parallels between what happened at Notre Dame and a typical production outage for us (at that time).

Training

The New York Times reported that “The security employee monitoring the smoke alarm panel at Notre-Dame cathedral was just three days on the job when the red warning light flashed on the evening of April 15: ‘Feu.’ Fire.” This ultimately led to nearly 30 minutes of time elapsing before they realized the mistake. The first hour was defined by that initial, critical mistake: The failure to identify the location of the fire.

Are we guilty of putting people in critical roles without the proper training? No matter how talented and well-qualified any employee may be, the initial break-in period is crucial. Oh and don’t overwork your team. The article mentions that the security officer was pulling a second shift as his replacement did not show up….

In the SRE world this will lead to prolonged outages and missed KPIs.

  • Properly onboard and train new team members
  • Ensure they are ready before they take on critical roles by themselves
  • Focus on isolating the challenge, not always solving it

Monitoring & Alerting

When the smoke and fire alarms we alerting they were obscure and non-actionable. The New York Times calls the fire warning system at Notre-Dame “so arcane that when it was called upon to do the one thing that mattered — warn ‘fire!’ and say where — it produced instead a nearly indecipherable message.” This wasted more precious minutes as the security guard had to climb steep stairs to the attic. Since the alarm was not clear, he had to check several locations before identifying where the source of the fire and smoke was. Even the fire department had delays reaching the fire.

Just like in the SRE world

  • The alarm/alert should be clear and concise.
  • It should identify exactly where the challenge is
  • What action needs to be taken to resolve the situation as soon as possible.

Sense of urgency

Responding quickly is key. If the alert would have been clear, the guard could have isolated the fire much faster. It was also unclear what he was to do in the event of a fire. He phones his supervisor who did not answer immediately. They spent precious time troubleshooting the challenge instead of calling the fire department immediately. Heck, the fire department should have been called automatically. But there were questions on the reliability and accuracy of the fire detection system and this lead to noisy alerting. Similar to what SRE teams face on a daily basis. Alert fatigue is real when you are getting inundated with alerts. How do you know which ones are accurate and require immediate attention? Finally know when to abandon a dead work stream. Focus on what you can save.

  • Prioritize the alert. Critical alerts should be treated differently than others
  • Clear action steps for the responder to take when the alert arrives.
  • Every team member should know the standard operating procedures for each situation
  • Automate all the things. Can the alert automatically take an action?

In the midst of the national — even global — tragedy of the Notre-Dame fire, it should always be remembered that no lives were lost. No firefighters perished, but that danger could have been avoided. We could have also prevented and reduced the damage done to a such a historic treasure.

Give the article a few minutes of your team and see what lessons you can learn for your SRE or operations team.

Originally published at https://www.linkedin.com.

Principal Site Reliability Engineer. Cyber Security Professional. Technologist. Leader.