How Fast Can Your Organization Identify and Resolve IT Outages?

Another week, another IT outage—more lessons we can’t ignore

Man in suit at multiple monitors

It has happened again. It was just October 20 that one IT disruption in a major cloud provider environment impacted the applications and websites of hundreds of enterprise businesses and government agencies around the world. Just a week later, on October 29, another public cloud provider suffered a major disruption that affected hundreds of other web-based services, customer-facing applications, and online games.

The most recent outage reportedly started around 11:40 a.m. ET. The failure triggered widespread connectivity issues for millions of users and was a reminder of how dependent modern communications are on a small number of major networking providers.

Problem Identification and Resolution

The issue was determined to have been an inadvertent configuration change error. Others labeled it a “problematic configuration change.” To resolve the issue, the provider deployed the last known good configuration, and by 5:30 p.m. that same day, the provider believed that would resolve the issue. To achieve full success, the recovery effort required additional reloading of configurations and rebalancing of traffic across a large volume of nodes. Ultimately, the provider was able to restore normal operations at scale; shortly after 8:00 p.m. that evening, the vast majority of services had been restored.

According to the Uptime Institute, the top causes of IT outages are due to configurations or change management failures:

  • 45 percent of the time for network outages
  • 64 percent of the time for system/software outages

It bears repeating: IT disruptions are going to happen. If catastrophic outages can occur in the environments of the world’s largest and most advanced technology leaders, then they can happen in any enterprise.

What Organizations Can Learn from These IT Outages

Companies large and small have had to deal with the outages of their valued cloud providers these last couple of weeks, just as they had to with last summer’s major outage. Consider what could happen if the outage was in your corporate environment.  We posed the question last week: Are you prepared for an outage in your environment?  Let’s minimize the noise and just look at the causes and outage time for each of the incidents.

  • Last summer, a faulty software upgrade resulted in hours to days of disruptions and remediation for affected enterprises
  • Last week, a Domain Name System (DNS) issue took approximately 15 hours from detection to remediation
  • This week, a problematic configuration change caused an eight-hour outage

These are all very common, frequently occurring problems that can easily happen in any enterprise network. Would it mean an 8-hour or 15-hour outage for your business? Would it be shorter? Could it be longer?

How fast can your IT organization go from detection to remediation if one of these common issues occurs in your environment?

The Need for Rapid Response to IT Outages in Your Network

In the wake of two major internet outages just a week apart there is no doubt IT and executive leadership at corporations and government agencies around the world are having conversations. Some questions may be covering their network’s resilience, disaster recovery, and redundancy. Others may establish new policies and processes for handling the outages.

In our last blog, we offered a four-step process to follow if your organization is experiencing a network disruption:

  • Implement true observability—not just monitoring
  • Establish incident readiness processes
  • Understand what you control and don’t control
  • Build collaboration across teams and vendors

These principles have been reinforced by recent events, and the guidance in our previous blog is worth revisiting, because each step plays a meaningful role in improving how your organization responds to disruptions.

How Observability Can Impact MTTR

What did you think of your answer to the question “How fast can your organization identify and recover from an outage?” If the answer was unacceptable, now might be an excellent time to evaluate how observability can reduce mean time to restore (MTTR) services following a disruption.

You’re not alone. In Enterprise Management Associates’ April 2025 study “Enterprise Strategies for Hybrid, Multi-Cloud Networks,” the IT research and consulting firm reported that only 29 percent of survey respondents were fully satisfied with their monitoring solution. Legacy, reactive troubleshooting tools; individual vendor point tools; and gaps in visibility have made many monitoring products obsolete.

Implementing the Best in Observability and DPI at Scale

Leveraging the right observability solution, supported by deep packet inspection (DPI) at scale, can significantly reduce MTTR when issues arise in your network. Much has changed in how networks are architected today, many of which may have evolved into more distributed, hybrid setups. Some critical business services might still run in private cloud environments, while many everyday applications now live in public cloud or colocation facilities, or are delivered via software-as-a-service (SaaS) and unified-communications-as-a-service (UCaaS) providers. Employees may connect via virtual private network (VPN) or virtual desktop infrastructure (VDI) hosted in colocation sites, and internet and wide-area network (WAN) services are delivered by multiple carriers around the world. The path from user to application has become far more complex, and IT organizations no longer have full visibility or control across all potential points of failure.

An observability strategy that overcomes visibility gaps enterprisewide— from remote locations through to hybrid/multicloud—with DPI at scale has the potential to dramatically reduce MTTR. DPI-based observability reveals the actual traffic flows across the infrastructure, showing the interactions between applications, services, and networks in real time. For instance, when DNS fails, a software update breaks a dependency, or a configuration change impacts service delivery, DPI can help pinpoint where in the ecosystem it exists and the community of users it is impacting. DPI reduces the mean time to knowledge (MTTK) on why the problem exists as well as lowering the overall MTTR for services in the environment (see Figure 1).

Figure 1: The top graph illustrates the mean time to resolution (MTTR) problem management lifecycle that includes four stages: identification, knowledge, fix, and verify. Each stage takes up a varying length of time in the overall process. The bottom graph shows that by reducing the time spent in MTTI and MTTK stages, organizations can dramatically reduce overall MTTR.

Figure 1: Observability strategy leveraging DPI at scale can provide ecosystemwide analysis to identify and resolve problems such as the ones causing recent internet outages.

Are You Ready to Reduce MTTR in Your Environment?

No one wants to be the latest headline in a news cycle. If resolving incidents in your environment takes longer than it should, delaying action only increases risk. Modern networks are more distributed, complex, and dependent on third-party services than ever, which makes identifying issues and restoring services difficult without the right visibility. NETSCOUT can help you build an observability strategy that restores control, accelerates resolution, and strengthens resilience in your digital environment.

Learn more about NETSCOUT’s observability solutions and how you can use DPI for Smart Data to put control at your fingertips.