In: Computer Science
Solution) Crisis/Incident
Postmortem: It refers to a process that enables an
incident response team to learn from past downtime, outages and
other incidents. During post-mortem, an incident response team
determines what happened during an incident, identifies what was
done right and what can be corrected, learns from its mistakes and
proceeds accordingly.
A Production Outage
Post-mortem
Production outages can be stressful, but they can also result in
valuable lessons. Here's are some tips on conducting a post-mortem
to prevent repeat occurrences.
Working in IT has many benefits; plenty of employment opportunities, interesting and challenging work and the ability to get involved with a lot of cool technology.
The flip side can be long nights, maddening problems, and - probably dreaded most of all by every IT pro - a production outage, where critical systems or services are rendered unavailable, either through human action or technical failure.
There's no greater stress in IT than being the one responsible for getting the lights back on, especially when the source of the problem is unclear. Additional worries about one's ongoing employment don't help matters, either.
Resolving the problem is often cause
for celebration -- and rightfully so -- but it's important not to
just blithely move on to the next issue. A production outage is a
serious condition which merits significant introspection to help
safeguard the company, and one's career against a reoccurrence of
the problem, or being impacted by a similar one.
1) What went wrong?
Answer) This a common cause for production outages is that safeguards which were put in place to prevent such incidents either didn't work or went ignored.
For example, an Exchange server's log volume fills up, forcing the server to shut down. Emails had been sent to staff for some time alerting them that the disk space was low, but these were being filtered to another folder and went unnoticed. Or, perhaps the alerts were configured to be sent to one individual rather than the group, and that individual is the former email administrator and is no longer with the company. It could be that staff weren't notified via email that a system was dead since the notifications relied on that very same system and it a standalone server.
It can be a human error, a memory
leak, a failed hardware component, bad firmware, a faulty patch or
some other element. If possible, engage the vendor since they can
usually zero in on the cause of such problems much more rapidly
than average IT staff who juggle multiple responsibilities and
talents.
If the outage was deliberately caused by malicious intent
(something certainly infrequent and likely difficult to establish)
then some measure of discipline should be applied, depending on
managerial and HR standards.
2) Identify any systematic problems.
Answer)
3) Document any lessons
learned.
Answer) Consider
implementing additional monitoring and alerting, such as leveraging
text messaging capabilities to contact IT staff immediately when
potential problems are detected. Perhaps redundancy can be
introduced or improved so that a single server runs in a cluster or
an active/passive setup so a server failure won't cause service
downtime. Using multiple ISPs with multiple internet gateways can
help network traffic keep flowing if there is an ISP outage or an
upstream router fails. Even conducting daily physical walk-throughs
of a data center can come in handy to spot warning lights or
discover alarm bells on a system experiencing problems.
To improve human processes:
The technology part is only half of the improvement plan. Better human practices often go hand-in-hand with preventing future outages, especially if this one was caused by human error or misconduct.
Consider whether a "peer approval" system - whereby one person
types a command and the other person verifies this is correct
before the enter key is pressed - might come in handy.
Implement and test the improvements:
Put your proposed changes in place, document the improvements and notify staff of the details and how to administer them (if applicable) so these will become the new standards going forward.
But don't just blindly trust that this will work and there's no
need for further concern. Test the changes during an arranged
maintenance window. For instance, with the example of the Exchange
server with the full log volume, copy a set of large files to the
drive to bring it up to a level which should trigger an alert (75%
full, for instance) and confirm the appropriate personnel were
contacted accordingly.
Decide who to notify:
This can be one of the toughest steps listed here. Now that the
incident is being properly wrapped up and laid to rest, notifying
users or customers of a production outage may still be a necessary
step even after it's been resolved so that they understand what
happened and what's being done about it.
Move on and adjust as needed:
A production outage can be costly, time-consuming, frustrating and even embarrassing. Many an IT professional has taken a hit to their ego and reputation (or the perception thereof) and found it difficult to let go of such episodes and move on.
It's important to do so for the sake of one's morale and career, however -- not to mention not letting such matters eat away at your attention span and thereby causing further technological problems.
Adjust the improvements put in place here as needed and keep in mind some outages may be inevitable, as every ISP or telephone company can attest, so the question should not be, "Did something bad happen?" but "What did we do to solve the problem?".