By Leland Takamine, Brian Attwell

Last October, Uber’s Mobile Engineering team kicked off an effort to improve app performance, and we’ve made great progress so far with speedups of well over 50 percent for some of our key transitions. Early on, we learned that certain classes of performance issues are trivial to root cause. It’s easy to spot I/O on the main thread, for example. In those cases, we found conventional Android performance profiling tools quite sufficient for debugging.

More complex investigations, however, were sometimes rendered inconclusive due to incomplete or inaccurate data. Some questions we initially struggled to answer included “Why are certain animations slow to start?” and “Why is TextView inflation slow in some cases?” After running into limitations of the Android Studio CPU profiler, we built Nanoscope, an internal tool to provide us with better method tracing.

Since the implementation of our initial prototype, we’ve been using and iterating on the internal tracer, and are now able to confidently debug those difficult performance issues. Among other discoveries, we’ve since found that building animation hardware layers is more expensive than expected, and that TextView text autosizing is much slower if you don’t use granularity.

Our internal method tracer continues to provide us with unprecedented insight into the performance of our apps, so we’ve decided to share that tool with the rest of the Android community. Today, we are excited to release Nanoscope: an extremely accurate method tracing tool for Android.

1. Use nanoscope command to start tracing.
2. Explore flamegraph in the Nanoscope visualizer.

Motivation

We understand the value of leveraging existing tools and believe that new tools warrant thorough justification, so before diving into how Nanoscope works, we’ll take a look at Android Studio’s performance tools and where they fell short for us.

Android Studio method tracing

Like Nanoscope, Android Studio provides method tracing functionality. The main blocker for us was the significant performance overhead introduced by Android Studio’s method tracing instrumentation.

Left: Nanoscope tracing. Right: Android Studio method tracing.
Time from click to beginning of animation.

Some of our key transitions ran multiple orders of magnitude slower with Android Studio method tracing enabled. Any method tracing will slow down runtime performance a certain amount due to extra logging logic, but at this level of distortion, the resulting performance profiles were no longer an accurate representation of normal app usage and were not useful for our performance investigations.

Android Studio method sampling

In addition to method tracing, Android Studio offers method sampling as an alternative that promises significantly reduced impact on runtime performance. We tested this feature out and it is indeed possible to sample with very little overhead by configuring the sampling frequency, but it comes with tradeoffs. At lower frequencies, fewer measurements are taken and thus the total overhead is reduced at the cost of precision, as depicted in Figure 1:

Figure 1: Low Frequency Sampling

The app runs smoothly in this case but the trace is missing many important details. Higher frequencies produce a more complete trace, but require more measurements, increasing the impact on performance, depicted in Figure 2:

Figure 2: High Frequency Sampling

With a production data feedback loop of around three weeks, it’s important that we have an accurate understanding of our code’s performance profile lest we discover three weeks after committing a fix that it failed to solve the issue or made things worse. We found that at any given frequency, Android Studio method sampling lacked either the detail or the accuracy that we required. At this point, we began to wonder if it was possible to have both.

 

Nanoscope design

Before building a prototype, the first decision to make was whether our tool would be trace or sample-based. To avoid any concerns around incomplete data, we landed on a trace-based implementation. We also theorized that an optimal sampling tool would be less efficient than an optimal tracing tool due to the fact that each sample needs to walk the entire stack while a trace measurement simply logs the current method identifier.

The solution we came to was an extremely low-overhead, trace-based tool that could give us the detail and accuracy required to confidently debug our performance issues. Early results from a prototype were promising and encouraged us to continue work on what would eventually become Nanoscope.

The level of performance we’ve achieved with Nanoscope relies on deep integration with the operating system. To accomplish this, we implemented Nanoscope as a fork of the Android Open Source Project (AOSP). This strategy also serves as Nanoscope’s biggest barrier to entry for users as it requires a device running the custom operating system (May 7, 2018 update: You can now use Nanoscope without flashing a device by launching the Nanoscope Emulator), but with full control over the OS, our strategy is quite simple:

  1. Allocate an array to hold our trace data.
  2. On method entry:
    • Write the timestamp and method pointer to indicate a push to the call stack
  3. On method exit:
    • Write the timestamp and a null pointer to indicate a pop from the call stack

Interpreter

Our first task was to instrument the interpreter. Luckily, all methods executed by the interpreter flow through a single method:

We’ve added the TraceStart and TraceEnd methods to do the heavy minimal lifting of logging our trace data:

We first determine whether tracing is enabled for the thread by checking whether our trace data array exists. Then, we write a method identifier (or a nullptr for a pop) followed by the current timestamp, which is retrieved directly from a timer register for optimal performance.

Compiler

Not all Java methods are executed by the interpreter. Some methods are AOT or JIT-compiled into machine instructions and executed directly. In these cases, we could generate a call to our TraceStart/TraceEnd methods, but we avoid a jump by inlining the equivalent assembly instructions at the start and end of each compiled method. Below is the 64-bit assembly we generate for method entry:

We generate similar instructions for method exits and also include support for the 32-bit compiler.

Results

We’ve obsessed over minimizing the logic executed per method, and we’re really proud of the results. While tracing, Nanoscope introduces only 20 nanoseconds of overhead per method and less than 10 percent total overhead in our startup sequence. In addition to the discoveries mentioned earlier, here are some example performance problems that we now understand in depth thanks to Nanoscope:

  • WebViews are slow to initialize only the first time due to the initialization of Chromium.
  • Much of the time spent in Google Maps initialization is due to classloading.
  • MenuView performs rebinding/layout/inflation on every click event (github issue).

Nanoscope’s accuracy has also made it easy to categorize performance behavior locally instead of relying on averages of production measurements. We can now quickly answer the following questions, for example:

  • What percent of the transition is spent in View-related operations?
  • What percent of the transition is attributed to our platform libraries?
  • What percent of the transition is spent inside RxJava?

Since we’ve begun using Nanoscope, distorted or missing data has no longer been an obstacle for us when debugging Android performance issues at Uber.

 

Next steps

If you want to learn more about the architecture, check out the wiki, and if you are interesting in improving your app’s performance, please consider giving Nanoscope a try.

There are plenty of other interesting problems to tackle at Uber and we’re hiring!