Performance Toolbox: Measuring API latency

In this post, I'll explore what it is to measure, how to measure, and some things I keep in mind when measuring API (application programming interface) latency, specifically in the domain of APIs served over the HTTP protocol.

Disclosing OpenAPI ChatGPI

It's December 2022 and the internet is awash with the potential of OpenAPI's ChatGPT. Coincidentally, I'm also writing one of my more technical posts for the first time in a while. As a part of this post, I'd like to join the conversation of what it means to create content, ethically, in what is seemingly the dawn of a new digital age. I believe prompts could serve as citations; however, since ChatGPT will change, there's no guarantee that it will respond the same provided the same prompt. On the other hand, the internet as a whole isn't static, so maybe future citations will look something like Prompt: Help me understand XYZ. [Prompted: Dec. 6th, 2022], similar to how IEEE citations use [Accessed: Dec. 6th, 2022].

In places where I've used OpenAPI's ChatGPI, I'll disclose it which will look like this throughout the post:

Nick: Hey ChatGPT, I'm going to write a blog post where I credit you when your responses are used. Say "Thanks" in a minimal fashion

ChatGPT: Thanks.

If ChatGPT gets too wordy, I'll use [REDACTED] where necessary.

Context, why, and errata

This post and its follow-ups are an example of the experience that comes along with human trial and error, which over time has become a sort of guide for how I tackle performance problems. I don't claim this to be exhaustive or absolute.

The motivations for writing this and the subsequent posts are:

I'm writing for my future self who will refer back here.
There's a lack of literature on the topic.
I still have more to learn.
I'd like the internet to be faster. If I can contribute to that in any way, that's meaningful to me.

At the moment, I don't have plans to open source my blog, so if you find errors or issues, please create issues using this GitHub repository: https://github.com/olingern/blog-issues

Measuring

There's an old question¹ dating back to 1883 that gets at the heart of what it means to observe an event:

"If a tree falls in the forest, does it make a sound?"

Scientific American goes on to explain that there would be vibrations but no sound. It's a very interesting question and it gets to the heart of what it means to observe something. Vibrations exist with or without human observation, but it's our interpretation of that sound in either pressure level or frequency (the number of times per second that a sound pressure wave repeats itself)² which we measure in decibels and hertz.

There is a clear and important distinction between observing and measuring, though. To measure something, we need a mechanism — such as our ears or a ruler — to establish the measurement whereas observation, in my mind, is employing the methods of measurement. Much has already been said about observability and I'm not sure what I can offer in this uncanny valley that exists between Site Reliability Engineering and Software Development — but what I can offer is a simple metaphor for how I think about it. Observability, to me, is akin to periodically checking your speedometer as you travel down the highway. Speed on the highway is important to us because it's a function of safety for ourselves and others. Using a speedometer, you're able to check whether you're within the legal bounds of the speed limit. From time to time, you find yourself below or above the legal limit, but you bring yourself back within that acceptable range with the help of your speedometer. Now, imagine a world without speedometers. You no longer have a way to measure the miles or kilometers per hour you're traveling, thus the observability of the machine you're running is significantly blunted. Sadly, this is the state of many systems on the internet.

From here on out, I'll be focusing strictly on measuring API latency but measurement and observability go hand in hand.

When approaching performance problems, measurement is really the crux of tackling them. We could refer back to our previous question about a tree falling in the woods and ask, If a website makes an API call, is it slow? Obtuse, I know. Similar to how we can measure sound in decibels, we can (and typically do) measure latency in milliseconds, abbreviated ms, when talking about API performance. Every so often, you might see µs which stands for microseconds and the unfortunate might see seconds. It's unlikely you'll see an API respond in microseconds but you might see DNS lookups³ happen in that unit of time.

Just to clarify the relationship between microseconds, milliseconds, and seconds. Here's a couple of conversions between the three.

1 second = 1,000 milliseconds = 1e+6 microseconds
1 millisecond = 1000 microseconds

But, what is latency?

Nick: Hey ChatGPT, what is latency?

ChatGPT: Latency is the amount of time it takes for a data packet to travel from one point to another. In other words, it is the time delay between when a request is made and when a response is received. [REDACTED]. Latency is typically measured in milliseconds (ms) or microseconds (µs).

Maybe you're like me and played Starcraft in your teenage years. You could often blame your losses on latency:

SuperPwn: gg

Nick: gg, lag

Though in our case, we aren't concerned with the latency of things outside of our control. Sometimes, someone will experience high latency due to their hardware, such as someone dialing up your landline phone when connected to the internet via a 56K modem or — for the younger generation — traveling into an area where 4G and its successors are not supported. We're only concerned with how fast our application can respond not with how fast the clients can receive.

Baselines

This is your starting point. You could also think of this as beginning to train for running a mile. You establish that you can run a mile in 9:00 minutes on an indoor treadmill (a controlled environment) and you would like to improve to 8:00 minutes, which would be an 11.11% improvement (9 x 60 - 8 x 60) / 540 * 100

If you're just embarking on the journey of learning about how to measure APIs, it's tempting to open Chrome's network tab, hit refresh ten times on a page that calls the API you want to measure, take that average, and then say, "Great, done." That doesn't take into account all the different conditions that can impact users. For you, the endpoint responds in less than 50ms but for your friend, the endpoint responds in 1-2 seconds. This is because their cat, Fluffy, laid on their keyboard while pressing "A" for 30 seconds before being shooed away and somehow managing to click "save" in the process. The large size of the JSON payload causes your friend to experience higher latency than you do.

So, we need to find our starting point but we also need to flesh out how we talk about our starting point before defining our baselines.

Percentiles

Let's say we had access to all of our response times in milliseconds and they looked something like the below array. In real-world applications, we'll need much more volume for proper measurement, but for demonstration / educational purposes a small set is much easier to reason about and work with.

const responseTimesInMs = [45, 20, 15, 19, 2300, 20, 42, 33, 26, 25];

Before we continue on, make a mental note about what is or is not interesting about these response times. What is the user experience like? Is it consistent?

With very little knowledge, we could generalize that these response times are mostly consistent with one exception: 2300ms.

We also could calculate the average for this set of response times:

const avg =
  responseTimesInMs.reduce((prev, curr) => (prev += current)) /
  responseTimesInMs;

console.log(avg); // 254.5

To the folks who remember their statistics, you would quickly spot that 2300ms is an outlier and should be removed to get an average of 27.2ms. This is not wrong, but bear with me for a bit and I promise that I'll make my case.

Another way to look at this data is in percentiles. Let's ask ChatGPT about percentiles

Nick: Hey ChatGPT, what is a percentile in a dataset?

ChatGPT: A percentile in a dataset is a value that separates a certain percentage of the values in the dataset from the remaining values. For example, the 50th percentile (also known as the median) is the value that separates the lower half of the dataset from the upper half of the dataset. The 75th percentile is the value that separates the lower 75% of the dataset from the upper 25% of the dataset.

Nick: How do I calculate a percentile?

ChatGPT: To calculate the percentile of a dataset, you can first rank the data in ascending order. Then, you can use the following formula to calculate the percentile:

To summarize a percentile calculation

Sort the values from least to greatest.
Calculate the index for the percentile you're interested in.
Find the value in the sorted list at the calculated index.

There are varying ways to calculate the percentile index floating on the internet. Depending on how you ask ChatGPT, it will give you varying formulas. For our purposes we'll use this formula and pseudocode to calculate the index:

- idx is the desired percentage, divided by 100, multiplied by the array length
- if the result is a float, round to the nearest integer

Let's put it into action.

// Step 1, sort the list
const responseTimesInMs = [45, 20, 15, 19, 2300, 20, 42, 33, 26, 25];
const sorted = responseTimesInMs.sort((a, b) => a - b);
console.log(sorted); // [15, 19, 20, 20, 25, 26, 33, 42, 45, 2300]

// Step 2: Index calculation. 
// Where p is percentile. Shortened to keep within codeblock
const getIdx = (p) => Math.round((p / 100) * responseTimesInMs.length) - 1;

// Step 3: Define the percentiles we're interested in
const percentiles = [20, 40, 50, 60, 80, 90, 99];

// Loop over each percentile, calculate the idex, and then print the value out
for (const p of percentiles) {
  const idx = getIdx(p);

  console.log(`Percentile: ${p} | responseTimesInMs[idx]: ${responseTimesInMs[idx]}`)
}

Our final output for our for loop:

Percentile: 20 | responseTimesInMs[idx]: 19
Percentile: 40 | responseTimesInMs[idx]: 20
Percentile: 50 | responseTimesInMs[idx]: 25
Percentile: 60 | responseTimesInMs[idx]: 26
Percentile: 80 | responseTimesInMs[idx]: 42
Percentile: 90 | responseTimesInMs[idx]: 45
Percentile: 99 | responseTimesInMs[idx]: 2300

Now, we have some data that we can talk about! What we can extrapolate from this data is that 50% of our users will experience a response time of 25ms or less, 90% 45ms and less, and finally 99% 2300ms and less.

So, as the percentile increases, we're including more of our data and since our response times are sorted — there's a linear relationship between an increase in our percentile and our response times.

Nomenclature Update

From here on out I'll refer to different percentiles by pPercentile where p is pronounced as it is by itself and where Percentile is an integer. So, the 50th percentile would be p50 and pronounced "p fifty."

Meaningful percentiles

In the real world, p95s get thrown around the most and specifically for a single API endpoint and aggregates. By working with the p95, we're implicitly filtering out outliers in large data sets like our earlier instance of 2300ms. You could imagine a large set of response times where ~1-2% of users experience extreme latencies. When optimizing 5% and less latency, there are both diminishing returns and impact, system and business, depending on your scale. If you're a tech giant, that can be hundreds of millions of dollars, but if you're a startup with 1000 daily active users (DAUs), who on average generate ~8 API calls per session, 5% is only 400 of your daily 8000 total daily volume and likely not a significant source of user experience improvement or revenue optimization.

p50's are also meaningful as they are synonomous with the median in our data, so p50s and p95s allow us to generalize in terms such as: About half of our users have an experience of X ms and nearly all of our users have an experience of Y ms.

Hands on

At this point, we have enough to put percentiles into practice.

Establishing a mock API

The mock API will use plain old JavaScript and Node 14.x, but the current LTS should always work. If you don't have Node installed, nvm is a great piece of software for installing and managing Node.js versions.

I'll be using OpenAPI's ChatGPT to generate a Node.js mock server with the prompt:

Write a Node.js application with one endpoint /measure that produces a JSON 200 but with a random delay between 0 and 3 seconds

Setup the project

mkdir measure && cd measure
npm init -y && npm install express

Let's create a file, index.js, and add the following code:

const express = require("express");
const app = express();

const getRandomFloat = (min, max) =>  Math.random() * (max - min) + min;

app.get("/measure", (req, res) => {

  // artifically delay the response with a random delay between 0 and 3 seconds
  setTimeout(() => {
    res.status(200).send("OK");
  }, getRandomFloat(0, 10) * 1000);
});

app.listen(8080, () => {
  console.log("Server listening on port 8080");
});

Tools for measuring

In a follow up post, I'll cover a handful of tools that I've used over the years but for now, we'll use a simple CLI tool written in Rust, oha おはよう, because it requires little setup and gives us access to latency percentiles. More complex examples that require capturing a value and then using it in subsequent requests will require different tooling.

Oha installation instructions.

Booting up our mock API

$ node index.js
Server listening on port 8080

Sending 100 requests to our mock API

$ oha -n 100 http://localhost:8080/measure

Output

Oha's report is quite nice and provides a summary, a histogram, a latency distribution, and even some DNS lookup metrics. Pretty awesome. For our purposes, we'll only look at the latency distribution.

In our example, we artificially delayed our responses using a random integer between 0 and 10 to get a spread of latencies between 0 and 10 seconds. After running Oha against our /measure endpoint, we see that randomness is sorted into six latency distributions ranging from p10 to p99.

Latency distribution:
  10% in 1.0933 secs
  25% in 2.4909 secs
  50% in 5.2222 secs
  75% in 8.0483 secs
  90% in 8.9538 secs
  95% in 9.3190 secs
  99% in 9.9771 secs

Methodology

Maybe the most important aspect of measuring is consistency. This includes isolating time, origin, and environment (not actual vars but dev, staging, etc.) variables. We also need volume to get a zoomed-out view of our performance.

Time

Because web traffic isn't constant and many different scenarios can play out through a 24-hour and week span, it's important to be consistent. For example, because of the minimal traffic between 2 - 5 a.m., you hit frequent cold starts (when a serverless function doesn't have an instance ready and must start up to serve the request). When measuring hour over hour, you see no problems, yet they do still exist over a 24-hour time span. If you make a change and see a significant drop in average response times and p95s, those can be indicators that the change is meaningful but collecting the same data over 24-hours and one week would significantly raise confidence in the change.

Origin

In most cases, collecting data from a cloud provider's metrics or a monitoring service like Datadog will be the method most folks use to establish and compare their metrics. But, it must be said, that you shouldn't compare local latencies — like we've obtained before with Oha — to latencies collected in the cloud.

Environment

Lastly, similar to where we measure from, we also want to be consistent in what we're measuring. Our local environment will perform differently than a production environment. Production traffic can also make our application behave differently as well. Though it's always a good idea to push your change to a pre-production environment, measure it in pre-production, and then release and remeasure it in production.

Volume

Let's ask ChatGPT about this to see if it can be of help

Nick: Talk about why volume in collecting API latency data is important for get a zoomed out view succinctly

ChatGPT: Collecting a large volume of API latency data is important because it allows for a more complete and accurate view of API performance, enables identification of trends and patterns, and helps identify outliers and anomalies. This helps make informed decisions about optimizing performance and addressing potential problems.

Have you ever been on an airplane and had an epiphany about Earth's topography or a city's skyline? I know that I've marveled more than once at mountain ranges that I've taken for granted on foot or even New York City's skyscrapers that I set my sights on daily for years. Collecting data is much like the topography of Earth in that as certain clusters or anomalies begin to amass, they'll become more and more evident but when viewed in isolation — they’re easy to write off.

How much volume is enough? Well, it depends. If you're a startup, and you only have a handful of users but those users are complaining of performance, you might have to synthetically generate more load with a tool like k6 to get a better view into performance issues. This is not unlike finding a puncture in a bicycle's innertube.

Wrapping up

This post focused on what it is to measure API latency, some of the nomenclature, and ideologies around measuring. Now that we have the backbone for measurement in place, we can explore making changes in measurable ways in upcoming posts.

Thanks for reading!

If a tree falls in a forest. Retrieved from https://en.wikipedia.org/wiki/If_a_tree_falls_in_a_forest ↩
NPS. National Parks Service: Understanding Sound. Retrieved from https://www.nps.gov/subjects/sound/understandingsound.htm. ↩
DNS Lookup. Retrieved from https://www.techopedia.com/definition/29029/dns-lookup ↩