Better Metrics for Build Performance Measurement

While doing an architecture refactor recently, I was making large-scale code changes frequently and found the Android build speed had become utterly unbearable. I remember back when I was using an Intel-chip MacBook Pro, a full build took about 40 minutes. After a deep dive, I discovered the real culprit wasn't the project itself -- it was the security software. A fully-specced MacBook Pro was performing like a MacBook Air. Then Apple M1 came along and build speed improved by an order of magnitude. But lately, it's felt noticeably slower again. I was puzzled -- am I really the only one who thinks it's slow?

User Research

I'd previously hit a Gradle cache issue that doubled build times -- clearing the cache fixed it. But this time, clearing the cache changed nothing. A full build still took around 20 minutes. With only 8 hours in a workday, that's enough for just a handful of full builds. I'm not one to slack off, so I surveyed a few colleagues. Everyone agreed it was slow -- but tolerable. Why? Because it used to be 40 minutes, and 20 is already twice as fast! No comparison, no pain -- your outlook depends entirely on your frame of reference.

Initial Investigation

The slow builds weren't isolated to me, but I needed real data. Using git commit history, I estimated that each engineer spent roughly 1 hour per day on builds. The estimation method:

Total build time = full build time + incremental build time
Full build time = number of full builds * time per full build
Incremental build time = number of incremental builds * time per incremental build

Key data points:

Full build frequency

We can't directly count full builds, but we can infer the number. Full builds are triggered when:
- First build of the day: Gradle's dependency resolution cache defaults to a 24-hour cycle, so there's at least 1 full build per day
- Modifying shared modules: This forces nearly all modules to recompile. Git log easily reveals the frequency of shared code changes -- roughly 0.2 times per person per day
Incremental build frequency
- Assuming at least one build before each commit, git log also gives us the incremental build count -- roughly 10 per person per day

Using this algorithm with my own experience data:

Average full build time per person = 1.2 builds/day * 20 min = 24 min/day
Average incremental build time per person = 10 builds/day * 3 min = 30 min/day
Average total build time per person = 54 min/day

That looked pretty serious. I asked a colleague to collect actual build performance data from development environments. After about two weeks, the conclusion was:

Average build time is about 3.5 minutes, and average daily time spent on builds is about 35 minutes per person. Doesn't seem too bad.

What?! Why was this so different from my estimate?

Better Metrics

Based on the git log data, incremental builds are far more frequent than full builds. If you take the arithmetic mean, the extreme full-build values get completely averaged out by the incremental builds. So how do we find the real problem?

Forget the average!

What we should care about is "how much time each person spends on builds per day," not "how long a single build takes." So what's wrong with the 35-minute-per-day average? It's an arithmetic mean across everyone, and the differences between "people and machines" mean everyone's situation varies. The arithmetic mean hides these differences. For engineers with good hardware, builds genuinely aren't a problem. But there's huge variation in machine specs -- some people are still on Intel MacBook Pros due to onboarding timing, while others have M1s. Even M1s come in different core counts -- 10-core, 12-core, etc. How do we surface these differences in the data?

Histogram

From the raw build performance data, group by username, sum per day, and you get each engineer's daily build time. Then take the P90 of each engineer's daily build time and create a histogram with 30-minute buckets:

Histogram of Build Performance Per Person

The chart shows that nearly half of all engineers spend over 1 hour per day on builds, with some reaching as high as 4 hours. We also notice two points on the far right (x={20, 31}) that are outliers. How do we remove them?

Tail-Trimmed Histogram

In statistics, for data with "long tails" or extreme values, trimming and Winsorizing are common noise-removal techniques.

The reason for those two outliers at the histogram's tail: the laptop lid was closed during a build, suspending the process. We can use trimming to remove them:

Trimmed Histogram of Build Performance Per Person

How exactly is the tail trimmed? The method I used: create a histogram from the raw build data in minute-based buckets:

Original Build Performance Histogram

Then use cumulative frequency to find the P99.8 bucket and truncate everything beyond it:

Trimmed Build Performance Histogram

Remove the truncated build records (noise) from the original data, and you get the tail-trimmed histogram above.

The Real Problem

From the tail-trimmed histogram, the picture is clear:

14.29% of engineers spend at least 2 hours per day on builds
42.86% of engineers spend at least 1 hour per day on builds

This conclusion is far more consistent with my actual experience than "average daily build time is about 35 minutes."

Better Metrics for Build Performance Measurement

Better Metrics for Build Performance Measurement

User Research

Initial Investigation

Better Metrics

Histogram

Tail-Trimmed Histogram

The Real Problem

Reference