True zero downtime can never be humanly achieved but it’s still the target that Department of Veterans Affairs (VA) engineers strive for to ensure Veterans have consistent access to essential care and benefits. “Zero downtime” means that an application never goes down for updates, issues, or errors. As a result, application users, whether they’re a VA employee or a Veteran, have reliable, 24×7 access to the tools and systems they need to do their jobs or take advantage of VA benefits and services.  

Application performance monitoring (APM) is the foundational technology that VA has widely implemented to get as close as possible to zero downtime and make sure applications always stay running for Veterans and employees.  

APM alone, however, is not sufficient to strive for a zero-downtime goal. Zero or near zero downtime can be achieved using APM in conjunction with incident automation, alerting, and with the analysis and insights gained from machine learning on VA’s performance data. Expanding the use of AIOps built from VA’s lessons learned to fix impaired systems furthers our automation goal of zero downtime.

Application Performance Monitoring

APM tools are essential in application process management for maintaining and improving business processes over time. They are central to VA’s future goal of achieving zero downtime throughout our applications. Recently developed APM tools push us closer to our zero-downtime goal.

In 2021 the VA saw a significant decrease in mean time to repair (MTTR)[1] in applications and systems from 4.91 days to 2.45 days on 225 critical systems and applications, 77 of which were COVID-critical applications. The result of this proven MTTR reduction was increased access to these essential healthcare and service applications for Veterans and VA employees during the height of the pandemic.

Understanding Application Performance Monitoring at VA

VA’s Office of Information Technology (OIT) professionals support approximately 1,000 unique applications for Veteran healthcare and benefits-related business services, and we often serve over 20,000 Veterans daily within a single application. All VA applications must perform the business services they were designed for while providing the quality user experience (UX) that Veterans deserve and meeting industry standards.

Working towards a zero-downtime goal, most VA systems now use APM. While a zero-downtime goal may seem audacious, recent innovations by our Operations Triage Group (OTG), Enterprise Command Center (ECC), and others are making that dream closer to reality.

Application performance can be measured through the lens of a series of categories specific to the application being developed or improved, from cloud infrastructure to application dependencies. Bandwidth, memory load, and CPU utilization are small examples that APM tools can measure. APM tools are ideally suited for monitoring the business transactions of applications, thereby understanding the system from a more user-centric perspective. The intent is to collect actionable system intelligence that enables IT professionals to quickly identify and solve system problems and inefficiencies, resulting in improved performance.

Enterprise Command Center makes it easy for our customers to request monitoring through a simple online application portal. The portal enables system owners to take APM action steps to:

  • Onboard new customers,
  • Define required application performance thresholds,
  • Define desired actions and alerts when thresholds are exceeded,
  • Determine when automation is appropriate; and
  • Define required dashboards and associated views.

VA Case Study: Poisson Prediction Model

OTG developed a machine learning tool using a Poisson distribution analysis that enables faster detection and resolution of system anomalies than was previously possible. The OTG-developed tool reduces detection of performance degradation from 60 minutes down to 10. This tool’s efficiency allows VA to identify and address short-lived, negative application trends more quickly. For example, many VA systems have shorter periods to measure usage patterns, including Veterans’ ability to connect to telemedicine appointments.

Poisson Prediction Model Impact

The Poisson distribution analysis tool enables VA to determine abnormal performance behavior up to six times faster than our commercial monitoring tools, allowing for a significant reduction in system impairment time. This approach enabled VA to accommodate the increase of telemedicine appointments from 1,400 to 30,000 per day during the pandemic.

VA Case Study: Service Level Objective Modeling

One potential APM measurement is the use of well-defined service-level objectives (SLOs) in service-level agreements (SLAs). Service-level agreements help us fulfill our promise to Veterans that they receive the highest level of care possible. To measure service-level objectives, our Operation’s Triage Group’s IT specialists developed a “What If” tool that significantly simplifies selecting more meaningful service-level indicators (SLIs) – specific measurements of service-level objectives – to establish appropriate performance targets for measuring the health of our systems.

The “What If” service-level objective tool is a “no code” solution. Users do not need to understand software development; they only need to understand the variables related to their system’s performance. The “What If” tool allows users to drop and drag key performance indicators (KPIs) from Splunk logs and back test the key performance indicator data against the proposed service-level indicators. The user is alerted to the impact of the service-level indicators on the system’s error budget.

How Does APM Contribute to the Zero Downtime Goal?

APM tools collect any data that can impact an application’s performance. This data can efficiently demonstrate where an issue has occurred in business processes as detailed by one or more service-level objectives. When monitored service-level indicators are exceeded, and IT is alerted by monitoring, IT can begin reducing and remediating the application impairment. The application fix required to return to adequate business functionality should also be documented for future use in automated repairs.

Even when applications operate within the boundaries of acceptable performance, achieving business processing functionality as defined in service-level objectives, the performance data monitored by APM tools can be used to focus on specific performance weaknesses or bottlenecks. By monitoring and acting upon these performance bottlenecks, operations and development teams can preemptively work to minimize the computing time and processing energy required for applications to complete their tasks.

By adopting site reliability engineering (SRE) principles, application efficiency and resilience improvement can substantially enhance the entire user experience. The result is that enough bottlenecks are reduced to prevent future application impairments, thereby minimizing the effects of heavy CPU loads or a higher-than-expected number of concurrent users degrading system performance.

Automation is the friend of modern IT APM delivery, but automation may mask performance issues due to automated load balancing, for example. Load balancing can obfuscate which servers are working harder than others.

VA’s Five Elements of End-to-End Monitoring Strategy

VA’s Enterprise Command Center and others have devised a sophisticated monitoring strategy to help combat system impairments and their impact to our Veterans. From April through August 2022, APM tool usage has risen from 51 percent to over 70 percent, illustrating the site reliability engineering principle that the Operational Triage Group and others are promoting – continuous improvement.

VA focuses on five elements of end-to-end application monitoring. These five elements allow for successful cross-organization cooperation to monitor VA systems.

  • Application and device monitoring involve a full-stack monitoring tool that measures application and business performance, end-user monitoring, and infrastructure visibility.
  • Network and operating system monitoring measure how network latency will impact the end-user experience within applications.
  • Cloud monitoring includes a comprehensive cloud monitoring solution for collecting, analyzing, and providing telemetry from our cloud environments.
  • Security and dashboard monitoring use a tool to collect data from multiple sources to develop dashboards with various views that developers and stakeholders can efficiently view to make assessments.
  • Event management anticipates issues using predictive analytics using machine learning or other technologies to correlate events and produce actionable alerts.

Industry-Wide Components of Application Performance Monitoring

VA’s five-element monitoring strategy can be implemented via various APM components that work together to provide comprehensive system monitoring.

  • Analytics and reporting at VA are achieved by gathering data from various processes and modeling that into actionable information that can:
    • Compare performance changes and infrastructure changes,
      • Use historical and baseline data,
      • Set an expectation for typical app performance; and
      • Predict future problems to solve them before they occur.
  • Components monitoring at VA require IT specialists to track all the components of our IT infrastructure, including:
    • Servers,
      • Middleware,
      • Application components,
      • Network components; and
      • Operating systems.
  • “User” monitoring at VA tracks user experience in two ways:
    • Agentless monitoring tracks actual user traffic data, including browser type, location, and operating system.
      • Synthetic monitoring tracks bots that imitate user behaviors to simulate how users will interact with an application and what errors they may encounter before an application launch occurs.
  • Business transactions, not to be confused with commerce,define how Veterans interact with VA applications. Testing applications under specific conditions creates valuable data to improve the Veteran experience. This enables IT specialists to identify key actions or events that can occur to help or hinder Veterans as they navigate through an application.
  • Runtime application and architecture analysis is the backbone of application performance monitoring and involves analyzing both hardware and software components and how effectively they communicate. Valuable data patterns come from for measuring initial application performance and can be used as a benchmark for performance deviations.

Critical Application Metrics of Application Performance Monitoring

  • Application availability monitoring measures an application’s availability after updates and is used to determine industry standard compliance.
  • CPU usage monitoring tracks the server effort required to run applications, and how much memory storage is needed for a given time.
  • Web performance monitoring tracks the time it takes a user to navigate an application to see how application performance impacts their user experience.
  • Request rate and error rate monitoring enables IT specialists to see the traffic within an application from start to finish, and to note when and why software fails during a user’s journey.
  • Number of instances monitoring is vital for cloud-based application performance tracking because it calculates how many instances are happening within an application simultaneously.
  • Customer satisfaction monitors what users say about using an application with the other performance metrics, providing a clearer sense of where the application is succeeding or struggling.

Building a Roadmap to Success

Clear steps are essential for a successful APM deployment. Our Product Line Management team has developed four maturity levels to measure progress. Each maturity level contains specific and incremental steps that must be followed before proceeding. We are currently implementing aspects of level three, which is a testament to VA’s hard work and prolific progress.

Improved system resiliency leads us closer to zero downtime. As operations professionals and product owners, it is essential for us to develop better defined service-level agreements, service-level objectives, and service-level indicators. We must strive for deeper monitoring aligned directly with the business performance goals as outlined in the service-level objectives. Continued IT automation will reduce the toil of repetitive or error-prone work processes, and clearly documented steps will remediate all system impairments with an eye toward AIOps. Plan your future path with these tools and approaches in mind. Our Veterans are counting on us to deliver.


[1] Mean time to repair (MTTR) is the mean of the time to repair (TTR) measurement. TTR is defined as the time when the system is first impaired until the system is repaired well enough to be considered functional. Note, this is different than both time to restore and time to resolve. Time to restore is the measurement of time from impairment to total restoration of service to perfect condition whereas time to resolve is the duration of the service ticket from open to closure.

In this article

Building the VA Health and Benefits App
The Federal Government Needs to Get Creative to Attract the Brightest Minds in Technology

More stories