We are going to embark on a multi-part series to discuss how to measure, analyze, diagnose, and ultimately solve performance problems. First, however, we must define what we mean by a "performance problem." There are three major items we talk about when we discuss performance. However, as we all know, that's only part of the picture.
Latency and Throughput
When we think about performance problems, latency and throughput are the two measures that are immediately apparent.
Latency is the time it takes to complete a single operation. Simply put, it's the time taken for some action to complete. Gamers often see this represented as their "ping time." When a webpage is slow to load, that's high latency to the user. A new phone feels faster because the UI is more responsive -- the latency of operations is lower.
Throughput is the number of work items that can be completed in a unit of time. A work item can be anything -- production of a motorcycle, completion of a REST call, or the download of a packet of data. In the world we are talking about, the most common measure we see is requests per second. Essentially, "how many users can my site support at once?"
While the measures above are separate in their own way, they are often interwined quite tightly. Imagine a doctor's office, where each visit takes 15 minutes. If there is one doctor, and nobody in the waiting room, you will be seen immediately. If there are 4 people in front of you, you will be seen in an hour. Your perceieved latency is 1 hour. The throughput is 4 per hour. If the office adds another doctor, your perceieved latency is 30 minutes, and the office's throughput is 8 per hour. By increasing the throughput, we have decreased the latency.
We just learned that by increasing throughput, we can decrease latency. However, that's not quite enough to describe performance problems. The latency above was a fixed number -- 15 minutes. What if the latency was more variable? Some patients may take 5 minutes, other patients may take 2 hours. Additionally, imagine this office has 40 people in it. Your ability to see a doctor is delayed because of the number of people in front of you. This is refered to as congestion (or, if you want to be more computer science-y, queueing delay).
All Together Now
Let's bring this back to what most of us do: serve webpages. Let's start very simple, with a static HTML page and a server that can handle one request at a time. A single-threaded server handles a request, and returns a page. With little traffic, the latency is quite predictable. When we get more traffic, we start to experience congestion. The average latency to the user increases, and they become unhappy.
To combat that increase in traffic, we add another thread. We can now handle two requests at a time. With low enough traffic, requests are served immediately. However, the throughput increase is not exactly doubled. There is some overhead to managing threads, dispatching requests, context swaps by the CPU, number of cores, and more. For each increase in parallelization, we get a diminished return in throughput and latency. In fact, it's possible to create so many threads that the system is overwhelmed managing them all, and both throughput and latency are increased!
This can be further complicated, of course. Imagine a shared resource, like a database connection. In order for the software to work correctly, only one thread at a time is allowed to query and receive data from the database. This introduces yet another possible source of congestion, even if there is no congestion connecting to the web server. These shared resources must be carefully balanced with the rest of the system to optimize performance.
Tuning a system for maximum performance is the art of optimizing all three of these parameters with regards to your system. Maybe you don't care about congestion, just that requests are served a quickly as possible once processing starts. Maybe you don't care about latency, and just want raw throughput. Maybe you want no congestion whatsoever.
Future articles in this series will talk about how we measure, identify, analyze, and ultimately solve situations that cause poor performance.