IT Operations Crisis Management Lessons from Notre Dame
By Charles Araujo, Principal Analyst, Intellyx
I was on a conference call when the news flashed across my screen: “Notre Dame Cathedral is on fire.”
The news was shocking — and heartbreaking. As the tragedy unfolded, I remember thinking, “How did this happen?”
We now have some answers.
The New York Times recently published an in-depth analysis of the events of that fateful day. Within its comprehensive deconstruction, there are valuable lessons — both good and bad — for IT leaders during an operational crisis.
PLAN WITH INTENT — AND WITHOUT ASSUMPTIONS
“All the sensitive technology at the heart of system had been undone by a cascade of oversights and erroneous assumptions built into the overall design.”
Notre Dame was a national treasure. Its overseers knew full well the risk of fire to the cathedral — particularly in the latticework of ancient timbers in the attic. They, therefore, implemented an incredibly sensitive monitoring system that could detect even the faintest amount of smoke.
That monitoring system worked — but the overall management plan still failed. Why?
While the system alerted a security employee, who then relayed a message to a church guard, the overall system broke down because it used complex, difficult-to-understand jargon, and made a series of assumptions about both how a fire might unfold and how teams would communicate during the critical early stages of an incident.
It was those communication breakdowns and assumptions that were the flaw in the system.
When the fire was detected, it reported, “Attic Nave Sacristy ZDA-110-3-15-1 aspirating framework.”
The management plan assumed that whoever received this message would know what it meant — and that the on-site team would know what to do. Unfortunately, the employee who received it had been on the job for three days — and was probably ill-prepared to decipher this bewildering code. As a result, he and the guard on duty were unable to communicate and sort out what was happening quickly enough to respond while it still mattered.
I wonder how many IT operations teams have invested hundreds of thousands of dollars in sophisticated instrumentation and monitoring solutions, but then use similar jargon and make comparable assumptions in their management plans?
It’s easy to buy the tool and create the pretty dashboards, but when the crisis hits, it will be the clarity of communications and the ability to instantly respond that will determine whether or not it will be a catastrophe.
RESPOND FIRST, DO ANALYSIS LATER
While many are placing the blame at the feet of the security company and guards, the real culprit lies in the management plan itself.
The procedure, in the event of an alarm, was for the security employee monitoring the system to radio the guard to check it out. Because of the arcane message, the miscommunication between the employee and the guard, and the inability for the employee to contact his boss, that process took thirty minutes.
By the time they had sorted it out, the fire was already burning out of control.
But while everyone is busy pointing fingers and casting blame, they should instead be asking why they had that procedure in the first place.
In a place as susceptible to fire as Notre Dame, every moment counted. Why wasn’t the procedure to immediately call the fire department and have every available resource check everything?
Unfortunately, many IT operations teams follow a similarly slow analysis-driven procedure when they have priority one incidents: they begin a triage process rather than ringing the bells and calling everyone to their stations.
When I think of how IT operations team should respond to a crisis, I like to imagine Captain Kirk speaking into his communicator, “Red Alert” — and watching everyone run to their stations ready for battle. When a crisis hits, the response is the first order of business — and it should be all hands on deck.
If you’re thinking to yourself, “We can’t do that — we have way too many priority ones,” then that’s the problem you need to fix first. Otherwise, ring the bell, blow the horn, yell “Red Alert” and put everyone one to work checking everything — this is no time for analysis.
If it turns out to be a false alarm, apologize and consider it a good training drill. That will always be better than the alternative.
THERE CAN BE ONLY ONE OFFICER OF THE DECK
Imagine the scene.
It was a bit after 8:30 pm, just over two hours had passed since the fire began, and General Jean-Claude Gallet, head of the Paris fire brigade walked into a room across the plaza from Notre Dame.
In that room sat the President of France, the Prime Minister, the Mayor of Paris, the Monsignor of Notre Dame, and about twenty other top government and church officials. He had come to tell them that he had decided to let the roof go and focus instead on the tower.
I can only imagine the weight of that decision. But the most remarkable part of this episode is what happened next: nothing.
According to the Times, the assembled officials — all of whom are used to being in charge and being the ones making decisions — did nothing. They did not shout that it was unacceptable; they did not question; they did not demand to make the final decision.
Instead, they did the right thing, the brave thing, and the hard thing: they let General Gallet do his job.
That act — the act of letting the General do his job, and to trust in his decision — is most likely what saved Notre Dame.
During a crisis, there can only be one Officer of the Deck — a naval term for the officer on the bridge and in charge of the ship at any given point in time.
In the event of a priority one event, do you know who is in charge? Do you have a procedure for changing the Officer of the Deck, should you require it? Have you practiced having one person — and only one person — in charge so that everyone knows what that feels like and how they should respond?
In many IT crises, the most significant barriers to resolution are the lack of a defined mechanism for determining who is responsible, for transitioning responsibility when required, and the meddling of the higher-ups who feel they must be involved, but who do not want to take command.
As the tragedy of Notre Dame demonstrated, however, get this right and you’ll have a fighting chance even if everything else goes wrong.
HEROIC ACTIONS WITHOUT A HERO CULTURE
The story of how the firefighters on the scene saved Notre Dame is a real-life story of heroism. In the face of impossible odds and an almost no-win situation, they pushed through a point-of-no-return and placed their lives on the line to save their beloved cathedral.
While the firefighters were undoubtedly heroes and performed heroic deeds, however, what they did not exhibit was a hero culture.
They did not charge recklessly into dangerous situations. They did not take actions without direction because they thought it was the right thing to do.
Instead, they were methodical, they were deliberate, and they were in constant communication with their leaders at command, relaying valuable on-the-ground information that those leaders needed to make informed decisions.
In many IT organizations, however, this is not the case during a crisis.
When a priority event occurs, many IT operations teams take it as an excuse to throw out procedures, to ignore documentation, and just do “whatever it takes” to solve the problem.
But embracing or accepting this type of hero culture will not help you respond to a crisis more rapidly or effectively — it will have the opposite effect. Worse, it will allow the hero culture to spread like a cancer and undermine the entire operational management model.
When the time comes to ring that bell, you want heroic actions taken within a controlled management model, rather than a hero culture. Confuse them at your peril.
THE INTELLYX TAKE: BE WILLING TO MAKE THE TOUGH DECISIONS
While it may seem a stretch to compare the situation at Notre Dame with the circumstances IT operations teams face, the reality is that the relative fragility and importance of the technology stack presents a very similar set of challenges in the modern enterprise.
At least within the context of an organization’s technology-driven competitive posture, many enterprises are now reliant on their own “forest,” as Notre Dame’s attic of ancient timbers was called.
Any many IT organizations are just plain not prepared for the day it catches fire.
The lessons from the tragedy of Notre Dame — both good and bad — will help you change that, but only if you’re ready to do so.
There is, however, one final lesson that may be the most important of all: be willing to make the tough decisions.
When General Gallet decided to let the roof go, it had to have been one of the toughest decisions of his life. However, he was able to make it, and to have the courage to face the country’s, city’s, and church’s leadership with it, because he remained steadfastly focused on his only goal: to save Notre Dame.
As an IT leader, your only goal must be to protect the competitive value of the organization. This singular focus should always be the case, but it can be easy to forget or lose sight of during a crisis — which is when it matters the most.
To survive your next crisis — and it will come — you must remain singularly focused on this goal, be willing to step away from the fight long enough to see the big picture, and then have the courage to make the decisions and take the actions you need to take to save your Notre Dame.
Copyright ©2019 Intellyx LLC. Intellyx advises enterprises on their digital transformation initiatives, and publishes the weekly Cortex and BrainCandy newsletters. Sharing or reprint of this work, edited for length with attribution is authorized, under a Creative Commons 4.0 International License. As of the time of writing, none of the organizations mentioned in this article are Intellyx customers. Image credit: manhhai.