Performance 101: Measurement

There is an old joke in the programming world regarding optimization:

The First Rule of Program Optimization: Don't do it. The Second Rule of Program Optimization (for experts only!): Don't do it yet.

Optimizing software is tricky business. One thing I've learned over the years is that the problem is almost never where you think it is.

Math

To get started, we need to have a quick math refresher. Say that you have an application that reads a REST request, makes some backend calls, and returns a JSON response.

The amount you gain from optimization is how much the optimized chunk of code is used in the overall lifecycle. If one backend call is responsible for 8% of your runtime, and you are able to 100% optimize it (reduce the time to 0), your overall metric is only seeing an 8% increase.

In our 8% example, the more likely optimization is somewhere around 20%. Therefore, if we look at it as a function of the total time, we have saved 1.6% on runtime. Imagine a different function is responsible for 20% of your runtime, but you can only optimize it by 10%. This actually works out to be better, as this now speeds up your application by 2%.

What you really want to do is identify the biggest pain points, and tackle those first. But, how do you find those?

Average, Average, and Average

We all know the term "average," but we need to be more specific when it comes to analysis. Let's say we have a collection of 20 numbers. Each individual number is called a sample.

If you remember your math, there are 3 things that could mean average: mean, median, and mode. Mode is not generally useful in this scenario. Mean and median are. Mean is what most people are referring to when they say average. The mean is simply adding all the samples up, and dividing by the number of samples. The median is the sample where, if you sorted the numbers, is right in the middle. If your distribution of samples is perfect (like a bell curve), the mean and median should be quite close. When they are not, you have a skewed distribution.

Percentiles

Averages give you a decent idea of what is happening, but it doesn't tell the whole story. Imagine I told you that your mean webpage response time was 500ms. Seems ok, right? What if we then said that 40% of the people were under 100ms, 40% were under 500ms, and 20% were over 2 seconds. The mean stays the same. This means that 20% of your users have an absolutely terrible experience. We call this a percentile. In much the same way we measure uptimes, there are a few key percentiles we will care about: 95%, 99%, and 99.9%. Most often in a percentile distribution, we are showing the count of items in the set that, when sorted, fit within the percentile bounds.

Standard Deviation

This should be the last bit of math. When you take the mean of a set of numbers, you can also calculate a standard deviation. This is the mean distance from the mean for all samples in the set. The smaller the number, the more concentrated your samples. The larger the number, the more variability you have.

Tools of the Trade

Performance analysis and optimization is a bit of a blended art and science. The art comes from optimizing the code. The science is how we find trouble spots. In the traditional fashion of science, we want to create a repeatable test. We will run this test frequently, making small tweaks in the code under test, and see how they affect the performance. The key principles in a good test suite are isolation and consistency. Given the same code base, you expect performance results to be comparable between runs.

Now that we know this, and we have our math refreshed, let's talk about some tools we use when evaluating performance.

Gatling

Gatling is one of our favorite tools for performance analysis. It is a high-performance load testing tool, with very pretty (read: management friendly) data visualizations. It is especially good for testing REST-based services. A typical result would look like this:

There are additional graphs and more fine-grained reporting available.

Gatling can also be set up to run as part of your CI server! It has a simple command-line interface, and the project can be maven-ized with this maven plugin. A continuous performance test of your application in a CI environment is fantastic for Continuous Delivery.

We use this largely as a replacement for JMeter. JMeter is significantly more difficult to get started in, and it's difficult to use without the GUI. Gatling is very well situated for headless operation.

Yammer Metrics

Ok, you have measured with Gatling and you've found something you don't like. How do we take it down to the next level? Yammer Metrics is our best friend for JVM-based applications. This is an AOP-based package that exposes several annotations. Our favorite annotation is @Timed, which is a combination of @Metered and @Histogram. From the documentation for @Metered:

Meters measure the rate of the events in a few different ways. The mean rate is the average rate of events. It’s generally useful for trivia, but as it represents the total rate for your application’s entire lifetime (e.g., the total number of requests handled, divided by the number of seconds the process has been running), it doesn’t offer a sense of recency. Luckily, meters also record three different exponentially-weighted moving average rates: the 1-, 5-, and 15-minute moving averages.

From the documentation for @Histogram:

Histogram metrics allow you to measure not just easy things like the min, mean, max, and standard deviation of values, but also quantiles like the median or 95th percentile.

Traditionally, the way the median (or any other quantile) is calculated is to take the entire data set, sort it, and take the value in the middle (or 1% from the end, for the 99th percentile). This works for small data sets, or batch processing systems, but not for high-throughput, low-latency services.

The solution for this is to sample the data as it goes through. By maintaining a small, manageable reservoir which is statistically representative of the data stream as a whole, we can quickly and easily calculate quantiles which are valid approximations of the actual quantiles. This technique is called reservoir sampling.

The best part about these annotations is that they can be in your app at all times. There are multiple ways to get the data out. The normal favorites (JMX, Servlet) are there. In addition, it can also stream data to Ganglia or Graphite to fit into an existing performance monitoring package.

YourKit

*Note: YourKit is paid, commercial software. While we generally try to recommend free/libre software, this tool is so much better than the OSS alternatives that we believe it is worth the money.

YourKit is a Java profiler. It runs a small agent in your VM, to which a GUI tool connects. The profiler incurs no performance penalty when it is not profiling the application, so it is safe to leave the JAR file in place in production.

The profiler has 2 ways of measuring: sampling and tracing. With sampling, the tool grabs a snapshot of stacktraces every few seconds, and determines some metrics. This is a very low-overhead method of gathering data.

Tracing is significantly more expensive. It will slow your VM by an order of magnitude, but its method tracing is perfect. This is where you want to go to get a deep dive on a method.

So what do you get for all of this? There's actually too much to talk about in this post. You get memory telemetry (how quickly objects are created and destroyed, what types of objects they are, and where they're created from). You get CPU telemetry, including charts on wall time, your process' time, and time spent in GC, all graphed in realtime. This also includes call stack telemetry. It looks at the path from the beginning of the action all the way down to each individual method. It counts the number of invocations, and tracks telemetry on the rate and latency. This is a fantastic way to identify which processes are taking the longest. In fact, let's look at a screenshot of this method telemetry:

This is clearly a Spring/Hiberate app interacting with a database. You can see the invocation count, but you can also see the percentages. Those percentages represent the proportion of total time that the method has taken. Luckly, YourKit has also abstracted this into another pane, the Hotspot Identifier. This is basically a no-brains way of determining which methods are taking the most time. This, as we discussed early in the article, will give us the best bang for our optimizing buck.

The Observer Effect

This is an axiom in physics that states you cannot measure a system without changing it. This is also true with all three tools above.

Gatling can consume a lot of resources. It's a bad idea to have both Gatling and the application under test on the same machine. It's also a bad idea to have Gatling on a machine that experiences variable load, as it can throw off your numbers. Ideally, you have a dedicated load testing machine to drive load.

Yammer Metrics has been quite minimal performance-wise, but it does incur some minor overhead. We happen to think that it's worth the small overhead for the insights it gives us, but that's a decision ultimately to be made by your engineering team.

YourKit hooks right into the VM. It incurs the overhead of measuring and monitoring all of the data an reporting it back to the tool. While this tool is mostly useful for tracking down a known performance issue, it can be fun to use it on your application to see if there are any easy wins performance-wise.

Tying it All Together

We now have 3 tools, and they work together for us. Gatling gives us a repeatable test runner. It can generate load on-demand at any time. We get a high-level picture of the end-user's experience. Yammer Metrics keeps real-time metrics data on all instrumented methods, and is exposed in a consumable interface. You can even hook up your favorite monitoring system to alert if performance drops. Finally, YourKit gives you the deep insight required to identify the trouble spots in your code base. Once you make a change that looks better in YourKit, you can test it with Gatling to confirm that it makes the end-user experience better, and use Yammer Metrics to confirm the ongoing performance.

Now that we have a platform for measuring speed issues, the next step is identifying and fixing the issues, which will be the subject of the next post in this series.