There are clear problems with the way organizations deal with software incidents today. According to The State of Monitoring 2016, 83% of the study’s respondents identified “quickly remediating service disruptions” as one of the top 5 monitoring challenges for their organizations. From the same report, 76% of respondents also indicated “quickly identifying service disruptions” as part of their top 5 monitoring challenges. Despite how important it is for organizations to deal with these challenges, only “12% are very satisfied with their approach”, and therefore there is plenty of opportunity for improvement.
Why are organizations unable to quickly identify and resolve incidents? Before we speculate, let’s take a deeper look into the typical software incident process and highlight some of its downfalls.
It all starts with the software incident. For the sake of discussion, let’s isolate our discussion to software faults but this process is used by many to deal with software incidents in general. Software faults are initially reported to front-line response, which either consists of an SRE/Operations team or a set of “on-call” engineers. In order of priority, front-line response must then 1) get the system back in working order, 2) assess the impact of the fault, and 3) gather data for later stages of investigation.
The nature of data gathering and analysis at this stage leads to missteps with serious consequences. Even with [core dumps (https://en.wikipedia.org/wiki/Core_dump), front-line response will often miss security attacks or software regressions that only become more apparent as incidents compound. Important signals are missed, impact is underestimated, and the fault is poorly prioritized and triaged as the process continues.
Developers are actively involved in the process as they help out with response and begin a deeper investigation of the fault. The State of Monitoring 2016 concurs as 60% of respondents “agree that developers are actively involved in supporting applications.”
It’s at this stage of the process where crucial clues are missed yet again. Whether you have a backtrace, a set of logs, or a core dump, it’s easy to overlook important information that can help resolve the hairiest of bugs. This information loss may force teams to rely on multiple instances of the same fault to occur or involve domain experts to help with the investigation before a root cause is found.
After the incident is resolved and the dust has begun to settle, the team and management must go through the time-consuming process of compiling and summarizing data to provide context to impacted parties. Supporting data for this report is often manually gathered from disparate systems. This stage could take days, if not weeks, before the post-mortem report is finalized and the incident can be brought to closure.
Beginning with the initial notification to incident closure, the process of dealing with software incidents is full of inefficiencies. The involvement of multiple teams, conflicting priorities and the nature of this process contribute to the incredibly high cost of software failures to an organization. These costs ultimately delay software releases, lower team morale and hurt company reputation.
At Backtrace, we’ve set out to improve this process with a holistic solution that tackles the pain points discussed above and much more. We’ll follow up with a post discussing our platform but in the mean time, sign up for a free trial or reach out to us.