Monumental screw-ups! Part One
Dave walked into the makeshift server room mid afternoon in May 1992.
An electric kettle half filled with water in hand.
Searching for a power outlet, his analytical mind snapped into action.
That server has a UPS (unlimited power supply) attached for any power outage, so I can unplug that server, boil the water and prepare my tea.
Down at the front counter, only a few clients were being served, the busy period had longed passed and the 4pm close was less than an hour away.
Back office and support staff were plodding along with their work as per normal.
As the kettle started to rumble with that early sound of water coming to the boil, Dave was wondering if the UPS actually cut in, it looked ominously dark and worse still, the server which had given way for a cup of Earl Grey, was as dead as a door nail.
The UPS failed and the production server just crashed.
Front counter staff starting getting a number of error messages, none of which they had seen before and called out to Bill the manager.
The clients and staff, took a pause, as Bill was in a panic, on the phone to the IT Manager, Jerry and up the stairs Bill ran to see what was going on.
Meanwhile, Dave stepped out of the server room and into Jerry’s office to explain what had happened, Jerry quickly beckoned his trusted advisor and senior IT consultant, Bob.
In a flash, Jerry, Bill, Bob and Dave, without his tea had convened next to the server room, Dave retreated to a quiet safe place and a Camomile tea, as Jerry, Bill and Bob hightailed it for Mary.
Mary was a key member of the small IT team, experienced analyst, developer and taken on a number of responsibilities and learning opportunities in her role. Bleeding edge, client server technologies, Unix, database, network and system administrator, just some of her many hats.
She had a solid grasp on all aspects of the system and the business needs. Mary was the right person for the job and the right person to ensure a full recovery of the production environment.
Around 10pm that night, Mary was home, tired, emotional and upset. She re-read the chapter on database recovery, having a full set of manuals at home. Closing the manual, the frustration and disappointment had peaked, as she suspected, having known and read this prior, the wrong action was taken some six and a half hours earlier.
What a monumental screw-up!
Mary was crammed in, hovering over her, Jerry, Bill and Bob in close proximity and she pushed her chair as close of possible to her desk, elbows squeezed into her sides as she manoeuvred the mouse and tapped at her keyboard, her shoulders up, eyes fixed on the screen ahead, occasionally a hand would come across, pointing at the monitor, suggestions, questions, statements would accompany it, there was a lot of noise happening, just enough panic and proof that too many cooks in the kitchen will spoil the broth!
The urgency was to recovery the system, now that the server had rebooted, but the database was in recovery, a cool calm collective mind, would have landed on…doing nothing. Always an option and sometimes the best one. In this case, doing nothing is exactly what this wonderful relational database system was engineered to do, in the event of such a failure, it would auto-magically restore itself and retain the integrity of the data.
The time to recover, would depend on the volume of transactions and specific settings in your environment. The time of day, traffic prior and these settings meant that the database would have recovered within 3–4 hours. Given that it was mid to late afternoon, this should have been a most acceptable scenario, do nothing with the database and initiate some manual processing and alternative arrangements.
Mary couldn’t think, she had no time, no space, doing nothing was not an option, intervention came swiftly, it was a mistake and that database would not be the same, it was not allowed to recover, the panic subdued, the system was back up, the application was running, front counter windows opened for the last fifteen minutes and Bill, Bob and Jerry retreated to their stations, somewhat content at their achievement.
The real pain was just about to begin and to cut a long story short, Mary and a small crew spent a month of Sundays to restore the integrity of that database, some might argue it was tainted for life….don’t mention month end reporting! As everyone had re-treated, Mary could see what had unravelled and spent until late that evening putting out spot fires, as she laid the database administrator manual down on the table at home, poured a glass of wine and imagined how she could have acted differently.
The panic that surrounded Mary was understandable, this was a brand new system and the first time something like this had occurred, only months earlier there was not a computer or any technology within this organisation, beyond a Casio calculator and a conveyer belt, which moved documents along a production line for processing by people with physical stamps in their hands, dead set!
So the first technology failure (Lipton anyone?) was accompanied by a series of screw-ups too valuable not to learn from.
What a monumental learning!
Here are some I wanted to share…what else did you observe or spring to mind from anything similar you have experienced?
(1) Assumptions make an ass out of you and me…test them now!
A simple and perhaps logical assumption started this big hullabaloo, if it wasn't the crazy idea that it was in any way ok to boil a kettle in the ‘server’ room, eventually the assumption that the UPS would work in the advent of power failure would prove false.
A simple controlled test of something so critical is imperative, if you make assumptions then test them quickly and often.
(2) Be a boy scout…be prepared!
Oh boy, oh boy, oh boy, some very practical steps can always be taken to be prepared for what can and will happen, have a plan B and be ready to execute.
This was a different time and a level of maturity had not been developed as yet in this environment, speed to deliver and at a remarkable low cost was brilliant, the lesson here quickly had business continuity plans and a proper server room built.
(3) Man your battle stations…calm leadership with proper judgement
Quite a lot here to unpack, for Mary as the expert knowledge worker, she needed space and support from her colleagues, especially the leaders. Bill would have been best placed back downstairs with his staff and customers, shifting to an early close or manual processing. Jerry providing communication to stakeholders and being the point of contact, allowing Mary to lead and own the issue at hand with support from Bob as/when/if she asked for it. Dave was already boiling another kettle, so he did real good.
Fast forward a couple of decades and Mary (as a leader) is pleased when someone provides feedback, in part describing her with the duck analogy “smooth on top, paddling hard below”, she learnt back in 1992 as she sipped that glass of Cabernet, her best action that day, would have been to stand up, turn to Jerry, Bill and Bob and ask them politely and sternly to leave the room and once she had a moment of peace, time and space, would update them and direct what she needed them to do next.
It’s easier to stand up to a duck than a grizzly bear (nothing against bears…substitute with any animal you are fearful of!).
Why Part One! Well, Mary, Jerry, Bill, Bob and Dave have plenty of friends with stories to share. We would love to hear from your friends also.
Shout out this week my first bosses boss, Terry, who on day one welcomed me and said ‘you will get paid to fix your own mistakes’, that was 1985, he was ahead of his time, giving me a graduate permission to learn, adapt, take ownership and have no fear….what a duck of a guy!