Two weeks ago, I promised Booster users that 3.0.0 would ship by the end of October. Integration tests were actually done by mid-October, but then I rewrote the integration test framework testkit-gradle-plugin from scratch (the third rewrite). On top of that, Travis-CI was migrating from https://travis-ci.org to https://travis-ci.com, and Booster was still on the old domain, causing long CI queue times. Combined with integration tests running too long (over 50 minutes), tasks kept getting killed by Travis-CI. So it was not until the last day of October that v3.0.0-alpha-3 was released. In fact, even when publishing the alpha, the timeout issue was still unresolved – I had to temporarily remove integration tests from CI. But I already had a solution in mind.

CI Timeout

Travis-CI has a build timeout policy. Builds are forcibly terminated in these cases:

  • No log output for 10 minutes
  • Public repo builds exceeding 50 minutes
  • Private repo builds exceeding 120 minutes

Why do Booster‘s integration tests take so long? That traces back to Booster‘s compatibility strategy.

AGP Version Compatibility

To make Booster run stably across all Android Gradle Plugin versions above 3.0.0, Booster has adapted to every minor version of AGP – currently 7 versions:

  • 3.0.0
  • 3.2.0
  • 3.3.0
  • 3.5.0
  • 3.6.0
  • 4.0.0
  • 4.1.0

Each version has around 30+ APIs to test for compatibility, and each test case covers both App and Library projects. That makes 7 * 30 * 2 = 420 test cases. Travis-CI‘s VM has a dual-core CPU and 7.5GB of RAM, so performance is predictably poor. Even on my MacBook Pro (i7 8-core, 16GB RAM), it takes nearly 40 minutes to run everything – an average of 5 seconds per test case, without parallel build enabled.

Due to Travis-CI‘s 50-minute timeout, I tried enabling Gradle‘s parallel build:

1
org.gradle.parallel=true

I expected some speedup, but the result shocked me. What previously took 5 seconds per test in serial mode now took nearly 20 seconds in parallel. Completely counterintuitive. As the saying goes: when things defy reason, something fishy is going on.

Why So Slow?

As mentioned in the Gradle OOM article, running Gradle tests uses Gradle TestKit, which spins up a Gradle Runner for each test case. Each test case is a Gradle Android project, and there are two types:

  • Android App
  • Android Library

So there are 420 Android projects to build. With a parallelism of 7 (7 Android projects building simultaneously), that means 60 rounds. At 5 seconds per round of 7 test cases, it should take only about 5 minutes. Even doubled, it should be 10 minutes. So why was the actual result so far off?

Cross-Process File Locking

Could Gradle be doing cross-process synchronization? The integration tests map to 7 Gradle versions for 7 Android Gradle Plugin versions:

Android Gradle Plugin Gradle
3.0.0 4.1
3.2.0 4.6
3.3.0 4.10.1
3.5.0 5.4.1
3.6.0 5.6.4
4.0.0 6.2
4.1.0 6.5

With parallel execution enabled, 7 different versions of Gradle would be running simultaneously. Could they be contending for a lock? I checked with lsof to see what file locks Gradle was using. Sure enough, 7 processes were competing for the same file lock, as shown below:

Cache Sharing Problem

This reminded me that testkit-gradle-plugin can set the Gradle Runner‘s cache directory via the org.gradle.testkit.dir system property or GradleRunner.withTestKitDir(File). The default is $TMPDIR/.gradle-test-kit-$USER. To share already-downloaded dependency caches, I had changed it to ~/.gradle/, allowing all different Gradle versions to share the cache directory. But this meant all versions competed for the ~/.gradle/caches/build-cache-1/build-cache-1.lock lock. Without sharing, every dependency would need to be re-downloaded. This issue was reported as issue-851 back in 2016, but Gradle did not provide a solution until Gradle 6.1Copying and reusing the cache. Clearly the Gradle team did not consider this a high priority.

Following this thread, I checked the cache size under ~/.gradle/ and quietly closed my browser.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
johnsonlee@johnsonlee:~/.gradle $ du -sh *
12K 6.0.1
92K build-scan-data
8.0K buildOutputCleanup
12G caches
1.7G daemon
4.0K gradle.properties
4.0K init.gradle
0 jdks
832K kotlin-profile
1.1M native
0 notifications
78M test-kit-daemon
0 workers
1.6G wrapper

You have got to be kidding me – COPY 12 GB? I might as well just re-download everything. Is there really no other way? Back in 2015 at Didi working on The One project, I had already discovered that Gradle could not truly parallelize builds across different projects. I worked around it using other means, as mentioned in Chapter 3: The Anti-Human Design People Complained About. Five years later, Gradle still had not fully solved this problem.

Having Your Cake and Eating It Too

Then it hit me – what about symlinks for cache sharing? As far as I know, all Gradle dependencies live under ~/.gradle/caches/modules-2/. So I just needed to create a modules-2 symlink in $TMPDIR/.gradle-test-kit-$USER/caches/ pointing to ~/.gradle/caches/modules-2/:

1
$ ln -s ~/.gradle/caches/modules-2 $TMPDIR/.gradle-test-kit-johnsonlee/caches/modules-2

And just like that, the directory link was in place:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
johnsonlee@johnsonlee:$TMPDIR/.gradle-test-kit-johnsonlee/caches $ ll
total 4
drwxr-xr-x 15 johnsonlee 510 Oct 31 21:08 ./
drwxr-xr-x 6 johnsonlee 204 Oct 31 20:23 ../
drwxr-xr-x 8 johnsonlee 272 Oct 31 20:23 4.1/
drwxr-xr-x 12 johnsonlee 408 Oct 31 21:31 4.10.1/
drwxr-xr-x 7 johnsonlee 238 Oct 31 20:23 4.6/
drwxr-xr-x 13 johnsonlee 442 Oct 31 21:14 5.4.1/
drwxr-xr-x 13 johnsonlee 442 Oct 31 21:14 5.6.4/
drwxr-xr-x 13 johnsonlee 442 Oct 31 21:12 6.2/
drwxr-xr-x 12 johnsonlee 408 Oct 31 21:16 6.5/
drwxr-xr-x 901 johnsonlee 30634 Oct 31 21:58 jars-3/
drwxr-xr-x 779 johnsonlee 26486 Oct 31 21:58 jars-8/
drwxr-xr-x 5 johnsonlee 170 Oct 31 20:23 journal-1/
lrwxr-xr-x 1 johnsonlee 43 Oct 31 21:08 modules-2 -> /Users/johnsonlee/.gradle/caches/modules-2//
drwxr-xr-x 6 johnsonlee 204 Oct 31 21:31 transforms-1/
drwxr-xr-x 5 johnsonlee 170 Oct 31 21:12 transforms-2/

Here are the actual results:

At the same 12-minute mark, the progress was noticeably faster than before. The entire build completed in 22m 40s – roughly a 2x speedup.

Despite the significant improvement, for engineers who pursue perfection, there are still some rough edges. The ~/.gradle/caches/build-cache-1 lock is gone, but another lock occasionally appears, as shown below:

As for this issue… to be continued.