Network engineers and network managers tend to think of application performance only in terms of speed: their goal is to make the application as fast as possible or, at the very least, as fast as it needs to be. Stability and predictability, however, are two additional factors that contribute to performance. A stable application results in a consistent user experience, meaning that it performs equally well during peak hours and off-peak hours. As its user base grows, even if suddenly (e.g., after a viral marketing campaign), the compute demands will increase. Therefore, a high-performance application should also be designed with predictability in mind. In other words, it should be known in advance how an increased load under various conditions will affect the infrastructure so that the appropriate actions can be taken to avoid performance degradation.

While load testing is an essential component of application performance management (APM), there’s much more to it than that. It’s important to understand that load testing alone does not solve problems; the internal workings of the application must be analyzed to identify trouble spots. Furthermore, load testing by itself does not directly help the developers avoid creating performance problems in their current or future code, nor does it alert operations in production to current or future performance issues. To ensure that APM’s goals of speed, stability, and predictability are met, the performance aspects of the following domains should be incorporated into your APM strategy.

It’s not uncommon for modern applications to have multiple web servers, application servers, and database servers. Consequently, data requests from the client may initiate a chain of convoluted processes that adversely affect the response time of the application, as well as introduce a variation in response times among requests. Making use of data caching where appropriate can alleviate these problems.

Unfortunately, a disproportionate amount of application performance issues originate from poor software engineering, as opposed to hardware limitations. Excessive computing and excessive use of memory, backend interactions, and bandwidth are the usual suspects. Other performance issues come from the inefficient use of caching and memory leaks that are caused by increasing numbers of objects that cannot be garbage collected.

The connection between configuration management and performance management is simple: if you don’t know what you have, then you can’t monitor it, fix it, tune it, simulate it, calculate it, prevent it, redesign it, etc. Application chain discovery software can be used to automate the discovery of your application chain and the updating of your configuration information in the database. This information can then be used to connect to incident process and capacity management in order to enhance application performance. If you decide to do any configuration management at all, pay careful attention to versioning, which will help when troubleshooting the application in a production environment. Adding pooling information (e.g., max pool sizes and time-outs) to the configuration database might be a worthwhile endeavor, too.

Capacity Management is about right-sizing your architecture and infrastructure to make sure that the application can handle traffic peaks, trends among users (e.g., mobile phones vs. desktops), planned events like marketing campaigns, and unplanned events such as outages. The last thing you want is for people to sit there pressing F5 while you’re trying to restore the service.

When right-sizing the application, revisit the application’s history and see if there have been any unplanned events that you can learn from. Going forward, arriving at a working capacity management process involves recording the right data via logs and storing it in a data warehouse for at least one year so that you can look for patterns. Analyzing the data on a regular basis (say, every couple of weeks) will go a long way toward being able to predict the resource requirements of the application over time. Armed with an abundance of data, you can then venture into the area of capacity planning and consider a bunch of “what if” scenarios. For example, what if the number of users tripled? How would that affect the application’s response time? What would happen if you removed a CPU/core?

Although it could be said that end-to-end monitoring falls under capacity management, it’s significant enough to place it in a separate domain. Since the Apache or IIS logs that you collect as part of your capacity management efforts may indicate application response times that are inconsistent with what your users are experiencing, you need to look at other factors that influence how long it takes for requests to be fully served, such as loading images and stylesheets, as well as the parsing of JavaScript. One way to assess the correlation between the response times shown in the logs and what the users actually experience is to perform a video analysis of the screens of your users as they navigate your application.

Application profiling is the process of learning what your application is doing from both the “outside in” (network profiling) and the “inside out” (language or framework profiling). In the former case, you would look at the network traffic entering and exiting your server and see if there are any inefficiencies. For example, if a user requests a single postal address, you wouldn’t want to see that the entire set of postal addresses in your database is being returned. Network profiling isn’t always enough to tune the application, though. Sometimes you can’t see where the CPU or memory issues are coming from. In that situation, profiling tools exist for several languages and frameworks that allow you to optimize application performance. You’ll be able to see where heavy CPU and memory usage is happening, especially when used in conjunction with a load test.

As mentioned previously, there’s more to APM than load testing. There’s also a battery of load tests to be conducted during the process:

Peak load test – simulates what the infrastructure experiences during a traffic peak

Duration test – simulates the amount of traffic that your application would normally get for a month or longer and analyzes the results by looking for things like memory leaks, connection leaks, abnormally high log file growth rates, etc.

Break point test – increases the load of the application until something breaks; also an important part of capacity management

Session exchange test – determines whether the application can handle session mix-ups (one user’s session data ends up in another user’s session data because they clicked on the same thing at the same time)

Fault tolerance test – switches off part of the infrastructure to see how the application responds

Performance Troubleshooting

When a bottleneck or other problem has surfaced, that’s not the time to get to know your application. The ideal approach to performance troubleshooting is to apply the knowledge that you have gained from all of the above domains to quickly solve performance problems in production. Essentially, it’s better to be proactive than reactive.

Response time, or time-behavior, is an example of a performance-related non-functional requirement (NFR) that can end up in your Service Level Agreement (SLA). Many other NFRs can influence performance, too, such as resource utilization, fault tolerance, capacity, and efficiency. For further reference, ISO 25010 is a useful model for assessing software quality by identifying and evaluating the NFRs that pertain to performance, performance testing, or performance analysis, all of which should factor into your SLA management efforts.

Today’s application environments are highly distributed and in a constant flux as they change and expand. Transactions move through multiple servers, and calls may branch out into a handful of threads. When a performance issue arises, any part of this large ecosystem of hardware, code, network, and back-ends could be the culprit. A comprehensive application performance management strategy will extend beyond load testing and allow you to pinpoint areas of concern within your application or its supporting infrastructure so that you can optimize for speed, stability, and predictability. These factors affect user satisfaction, which in turn affects the application’s profitability.