Question

In: Computer Science

Explan 1, 2, and 3, using a computer ? Conduct a crisis postmortem. 1- Evaluate what...

Explan 1, 2, and 3, using a computer ?

Conduct a crisis postmortem.
1- Evaluate what was wrong.
2- Identify any systematic problems.
3- Document any lessons learned.

Just explain the points ?

Solutions

Expert Solution

Solution) Crisis/Incident Postmortem: It refers to a process that enables an incident response team to learn from past downtime, outages and other incidents. During post-mortem, an incident response team determines what happened during an incident, identifies what was done right and what can be corrected, learns from its mistakes and proceeds accordingly.

A Production Outage Post-mortem

Production outages can be stressful, but they can also result in valuable lessons. Here's are some tips on conducting a post-mortem to prevent repeat occurrences.

Working in IT has many benefits; plenty of employment opportunities, interesting and challenging work and the ability to get involved with a lot of cool technology.

The flip side can be long nights, maddening problems, and - probably dreaded most of all by every IT pro - a production outage, where critical systems or services are rendered unavailable, either through human action or technical failure.

There's no greater stress in IT than being the one responsible for getting the lights back on, especially when the source of the problem is unclear. Additional worries about one's ongoing employment don't help matters, either.

Resolving the problem is often cause for celebration -- and rightfully so -- but it's important not to just blithely move on to the next issue. A production outage is a serious condition which merits significant introspection to help safeguard the company, and one's career against a reoccurrence of the problem, or being impacted by a similar one.



1) What went wrong?

Answer) This a common cause for production outages is that safeguards which were put in place to prevent such incidents either didn't work or went ignored.

For example, an Exchange server's log volume fills up, forcing the server to shut down. Emails had been sent to staff for some time alerting them that the disk space was low, but these were being filtered to another folder and went unnoticed. Or, perhaps the alerts were configured to be sent to one individual rather than the group, and that individual is the former email administrator and is no longer with the company. It could be that staff weren't notified via email that a system was dead since the notifications relied on that very same system and it a standalone server.

It can be a human error, a memory leak, a failed hardware component, bad firmware, a faulty patch or some other element. If possible, engage the vendor since they can usually zero in on the cause of such problems much more rapidly than average IT staff who juggle multiple responsibilities and talents.

If the outage was deliberately caused by malicious intent (something certainly infrequent and likely difficult to establish) then some measure of discipline should be applied, depending on managerial and HR standards.

2) Identify any systematic problems.

Answer)

  • Systems or services were affected
  • multiple file servers unavailable.
  • An Exchange server's log volume fills up, forcing the server to shut down.
  • Emails had been sent to staff for some time alerting them that the disk space was low, but these were being filtered to another folder and went unnoticed.

3) Document any lessons learned.

Answer) Consider implementing additional monitoring and alerting, such as leveraging text messaging capabilities to contact IT staff immediately when potential problems are detected. Perhaps redundancy can be introduced or improved so that a single server runs in a cluster or an active/passive setup so a server failure won't cause service downtime. Using multiple ISPs with multiple internet gateways can help network traffic keep flowing if there is an ISP outage or an upstream router fails. Even conducting daily physical walk-throughs of a data center can come in handy to spot warning lights or discover alarm bells on a system experiencing problems.

To improve human processes:

The technology part is only half of the improvement plan. Better human practices often go hand-in-hand with preventing future outages, especially if this one was caused by human error or misconduct.

Consider whether a "peer approval" system - whereby one person types a command and the other person verifies this is correct before the enter key is pressed - might come in handy.

Implement and test the improvements:

Put your proposed changes in place, document the improvements and notify staff of the details and how to administer them (if applicable) so these will become the new standards going forward.

But don't just blindly trust that this will work and there's no need for further concern. Test the changes during an arranged maintenance window. For instance, with the example of the Exchange server with the full log volume, copy a set of large files to the drive to bring it up to a level which should trigger an alert (75% full, for instance) and confirm the appropriate personnel were contacted accordingly.

Decide who to notify:

This can be one of the toughest steps listed here. Now that the incident is being properly wrapped up and laid to rest, notifying users or customers of a production outage may still be a necessary step even after it's been resolved so that they understand what happened and what's being done about it.

Move on and adjust as needed:

A production outage can be costly, time-consuming, frustrating and even embarrassing. Many an IT professional has taken a hit to their ego and reputation (or the perception thereof) and found it difficult to let go of such episodes and move on.

It's important to do so for the sake of one's morale and career, however -- not to mention not letting such matters eat away at your attention span and thereby causing further technological problems.

Adjust the improvements put in place here as needed and keep in mind some outages may be inevitable, as every ISP or telephone company can attest, so the question should not be, "Did something bad happen?" but "What did we do to solve the problem?".


Related Solutions

Explan 4, 5, and 7, using a computer ? The steps for crisis management are: 1-...
Explan 4, 5, and 7, using a computer ? The steps for crisis management are: 1- Announce and generally publicize the problem. 2- Assign responsibilities and authorities. 3- Update status frequently. 4- Relax resource constrains. 5- Have project personnel operate in burnout mode. 6- Establish a drop-dead date. 7- Clear out nonessential personnel.
1) Use an Excel spreadsheet to evaluate the Pear Computer Company proposal. 2) Conduct a sensitivity...
1) Use an Excel spreadsheet to evaluate the Pear Computer Company proposal. 2) Conduct a sensitivity analysis that focuses on the cost of capital. For a best case scenario, decrease the cost of capital by three percentage points. For a worst case scenario, increase the cost of capital by three percentage points. 3) You must provide one spreadsheet for each of the three situations—the base case estimate, the best case, and the worst case. 4) What do you recommend? Explain....
1) Use an Excel spreadsheet to evaluate the Pear Computer Company proposal (info below). 2) Conduct...
1) Use an Excel spreadsheet to evaluate the Pear Computer Company proposal (info below). 2) Conduct a sensitivity analysis that focuses on the cost of capital. For a best case scenario, decrease the cost of capital by three percentage points. For a worst case scenario, increase the cost of capital by three percentage points. 3) You must provide one spreadsheet for each of the three situations—the base case estimate, the best case, and the worst case. 4) What do you...
Evaluate the Pear Computer Company proposal. 2) Conduct a sensitivity analysis that focuses on the cost...
Evaluate the Pear Computer Company proposal. 2) Conduct a sensitivity analysis that focuses on the cost of capital. For a best case scenario, decrease the cost of capital by three percentage points. For a worst case scenario, increase the cost of capital by three percentage points. 3) You must provide one spreadsheet for each of the three situations—the base case estimate, the best case, and the worst case. 4) What do you recommend? Explain. You may type your recommendation and...
Multivariable calculus Evaluate: ∮ 3? 2 ?? + 2???? using two different methods. C is the...
Multivariable calculus Evaluate: ∮ 3? 2 ?? + 2???? using two different methods. C is the boundary of the graphs C y = x2 from (3, 9) to (0, 0) followed by the line segment from (0, 0) to (3, 9). 2. Evaluate: ∮(8? − ? 2 ) ?? + [2? − 3? 2 + ?]?? using one method. C is the boundary of the graph of a circle of radius 4 oriented counterclockwise
1) Use an Excel spreadsheet to evaluate the ABCD Company proposal. 2) Conduct a sensitivity analysis...
1) Use an Excel spreadsheet to evaluate the ABCD Company proposal. 2) Conduct a sensitivity analysis that focuses on the cost of capital. For a best case scenario, decrease the cost of capital by three percentage points. For a worst case scenario, increase the cost of capital by three percentage points. 3) You must provide one spreadsheet for each of the three situations—the base case estimate, the best case, and the worst case. 4) What do you recommend? Explain. You...
Evaluate the integral using partial fractions. ∫▒( x+3)/((x-9)^2 (x+2))
Evaluate the integral using partial fractions. ∫▒( x+3)/((x-9)^2 (x+2))
We conduct a 1/2 fractional design (no replicates) to evaluate movie ratings. 4 factors: - Length...
We conduct a 1/2 fractional design (no replicates) to evaluate movie ratings. 4 factors: - Length of movie - Genre of movie - popularity of movie - year of movie Is it possible to estimate all two-way interactions?
Evaluate (please answer all of them) 1) ∫ 1.67 ?^(1/3) ?? = 2) ∫ [(?^3+ sin(4?))...
Evaluate (please answer all of them) 1) ∫ 1.67 ?^(1/3) ?? = 2) ∫ [(?^3+ sin(4?)) / (?^4−cos(4?)+4)] ?? = 3) ∫ ???^8(4?)cot(4?) ?? = 4) ∫ sec^2(4?) ???^5(4?) ?? = 5) ∫ (4x^3) / ((x^4)+3) dx=
1. 1- Make an argument for using the WACC to evaluate leasing.  2- How is this consistent...
1. 1- Make an argument for using the WACC to evaluate leasing.  2- How is this consistent with other capital budgeting problems?  3- How does the “sequencing problem” come into play?
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT