A performance comparison between Java and C on the Nexus 5

Android phones have been growing ever more powerful with time, with the Nexus 5 sporting a quad-core 2.3 GHz Krait 400; this is a very powerful CPU for a mobile phone. With most Android apps being written in Java, does Java allow us to access all of that power? Or, put another way, is Java efficient enough, allowing tasks to complete more quickly and allowing the CPU to idle more, saving precious battery life?

In this post, I will take a look at a DSP filter adapted from coefficients generated with mkfilter, and compare three different implementations: one in C, one in Java, and one in Java with some manual optimizations. The source for these tests can be downloaded at the end of this post.

To compare the results, I ran the filter over an array of random data on the Nexus 5, and the compared the results to the fastest implementation. In the following table, a lower runtime is better, with the fastest implementation getting a relative runtime of 1.

Execution environment Options Relative runtime (lower is better)
gcc 4.8 1.00
gcc 4.8 (LOCAL_ARM_NEON := true) -ffast-math -O3 1.02
gcc 4.8 -ffast-math -O3 1.05
clang 3.4 (LOCAL_ARM_NEON := true) -ffast-math -O3 1.27
clang 3.4 -ffast-math -O3 1.42
clang 3.4 1.43
ART (manually-optimized) 2.22
Dalvik (manually-optimized) 2.87
ART (normal code) 7.99
Dalvik (normal code) 17.78

The statically-compiled C code gave the best execution times, followed by ART and then by Dalvik. The C code uses JNI via GetShortArrayRegion and SetShortArrayRegion to marshal the data from Java to C, and then back from C to Java once processing has completed.

The best performance came courtesy of GCC 4.8, with little variation between the different additional optimization options. Clang’s ARM builds are not quite as optimized as GCC’s; toggling LOCAL_ARM_NEON := true in the NDK makefile also makes a clear difference in performance.

Even the slowest native build using clang is not more than 43% slower than the best native build using gcc. Once we switch to Java, the variance starts to increase significantly, with the best runtime about 2.2x slower than native code, and the worst runtime a staggering 17.8x slower.

What explains the large difference? For one, it appears that both ART and Dalvik are limited in the amount of static optimizations that they are capable of. This is understandable in the case of Dalvik, since it uses a JIT and it’s also much older, but it is disappointing in the case of ART, since it uses ahead-of-time compilation.

Is there a way to speed up the Java code? I decided to try it out, by applying the same static optimizations I would have expected the compiler to do, like converting modulo to bit masks and inlining function calls. These changes resulted in one massive and hard to read function, but they also dramatically improved the runtime performance, with Dalvik speeding up from a 17.8x penalty to 2.9x, and ART speeding up from an 8.0x penalty to 2.2x.

The downside of this is that the code has to be abused to get this additional performance, and it still doesn’t come close to matching the ahead-of-time code generated by gcc and clang, which can surpass that performance without similar abuse of the code. The NDK is still a viable option for those looking for improved performance and more efficient code which consumes less battery over time.

Just for fun, I decided to try things out on a laptop with a 2.6 GHz Intel Core i7. For this table, the relative results are in the other direction, with 1x corresponding to the best time on the Nexus 5, 2x being twice as fast, and so on. The table starts with the best results first, as before.

Execution environment Options Relative speed (higher is better)
clang 3.4 -O3 -ffast-math -flto 8.38x
clang 3.4 -O3 -ffast-math 6.09x
Java SE 1.7u51 (manually-optimized) -XX:+AggressiveOpts 5.25x
Java SE 1.6u65 (manually-optimized) 3.85x
Java SE 1.6 (normal code) 2.44x

As on the Nexus 5, the C code runs faster, but to Java’s credit, the gap between the best & worst result is less than 4x, which is much less variance than we see with Dalvik or ART. Java 1.6 and 1.7 are very close to each other, unless “-XX:+AggressiveOpts” is used; with that option enabled, 1.7 is able to pull ahead.

There is still an unfortunate gap between the “normal” code and the manually-optimized code, which really should be closable with static analysis and inlining.

The other interesting result is that the gap between mobile and PC is closing over time, and even more so if you take power consumption into account. It’s quite impressive to see that as far as single-core performance goes, the PC and smartphone are closer than ever.

Conclusion

Recent Android devices are getting very powerful, and with the new ART runtime, common Java code can be executed quickly enough to keep user interfaces responsive and users happy.

Sometimes, though, we need to go further, and write demanding code that needs to run quickly and efficiently. With the latest Android devices, these algorithms may be able to run quickly enough in the Dalvik VM or with ART, but then we have to ask ourselves: is the benefit of using a single language worth the cost of lower performance? This isn’t just an academic question: lower performance means that we need to ask our users to give us more CPU cycles, which shortens their device’s battery life, heats up their phones, and makes them wait longer for results, and all because we didn’t want to write the code in another language.

For these reasons, writing some of our code in C/C++, FORTRAN, or another native language can still make a lot of sense.

For more reading on this topic, check out How Powerful is Your Nexus 7?

Source

dsp.c
#include "dsp.h"
#include <algorithm>
#include <cstdint>
#include <limits>

static constexpr int int16_min = std::numeric_limits<int16_t>::min();
static constexpr int int16_max = std::numeric_limits<int16_t>::max();

static inline int16_t clamp(int input)
{
     return std::max(int16_min, std::min(int16_max, input));
}

static inline int get_offset(const FilterState& filter_state, int relative_offset)
{
     return (filter_state.current + relative_offset) % filter_state.size;
}

static inline void push_sample(FilterState& filter_state, int16_t sample)
{
     filter_state.input[get_offset(filter_state, 0)] = sample;
     ++filter_state.current;
}

static inline int16_t get_output_sample(const FilterState& filter_state)
{
     return clamp(filter_state.output[get_offset(filter_state, 0)]);
}

static inline void apply_lowpass(FilterState& filter_state)
{
     double* x = filter_state.input;
     double* y = filter_state.output;

     y[get_offset(filter_state, 0)] =
       (  1.0 * (1.0 / 6.928330802e+06) * (x[get_offset(filter_state, -10)] + x[get_offset(filter_state,  -0)]))
     + ( 10.0 * (1.0 / 6.928330802e+06) * (x[get_offset(filter_state,  -9)] + x[get_offset(filter_state,  -1)]))
     + ( 45.0 * (1.0 / 6.928330802e+06) * (x[get_offset(filter_state,  -8)] + x[get_offset(filter_state,  -2)]))
     + (120.0 * (1.0 / 6.928330802e+06) * (x[get_offset(filter_state,  -7)] + x[get_offset(filter_state,  -3)]))
     + (210.0 * (1.0 / 6.928330802e+06) * (x[get_offset(filter_state,  -6)] + x[get_offset(filter_state,  -4)]))
     + (252.0 * (1.0 / 6.928330802e+06) *  x[get_offset(filter_state,  -5)])

     + (  -0.4441854896 * y[get_offset(filter_state, -10)])
     + (   4.2144719035 * y[get_offset(filter_state,  -9)])
     + ( -18.5365677633 * y[get_offset(filter_state,  -8)])
     + (  49.7394321983 * y[get_offset(filter_state,  -7)])
     + ( -90.1491003509 * y[get_offset(filter_state,  -6)])
     + ( 115.3235358151 * y[get_offset(filter_state,  -5)])
     + (-105.4969191433 * y[get_offset(filter_state,  -4)])
     + (  68.1964705422 * y[get_offset(filter_state,  -3)])
     + ( -29.8484881821 * y[get_offset(filter_state,  -2)])
     + (   8.0012026712 * y[get_offset(filter_state,  -1)]);
}

void apply_lowpass(FilterState& filter_state, const int16_t* input, int16_t* output, int length)
{
     for (int i = 0; i < length; ++i) {
          push_sample(filter_state, input[i]);
          apply_lowpass(filter_state);
          output[i] = get_output_sample(filter_state);
     }
}
dsp.h
#include <cstdint>

struct FilterState {
	static constexpr int size = 16;

    double input[size];
    double output[size];
	unsigned int current;

	FilterState() : input{}, output{}, current{} {}
};

void apply_lowpass(FilterState& filter_state, const int16_t* input, int16_t* output, int length);

Here is the Java adaptation of the C code:

package com.example.perftest;

import com.example.perftest.DspJavaManuallyOptimized.FilterState;

public class DspJava {
	public static class FilterState {
		static final int size = 16;

		final double input[] = new double[size];
		final double output[] = new double[size];

		int current;
	}

	static short clamp(short input) {
		return (short) Math.max(Short.MIN_VALUE, Math.min(Short.MAX_VALUE, input));
	}

	static int getOffset(FilterState filterState, int relativeOffset) {
		return ((filterState.current + relativeOffset) % FilterState.size + FilterState.size) % FilterState.size;
	}

	static void pushSample(FilterState filterState, short sample) {
		filterState.input[getOffset(filterState, 0)] = sample;
		++filterState.current;
	}

	static short getOutputSample(FilterState filterState) {
		return clamp((short) filterState.output[getOffset(filterState, 0)]);
	}
	
	static void applyLowpass(FilterState filterState) {
		final double[] x = filterState.input;
		final double[] y = filterState.output;

		y[getOffset(filterState, 0)] =
		   (  1.0 * (1.0 / 6.928330802e+06) * (x[getOffset(filterState, -10)] + x[getOffset(filterState,  -0)]))
		 + ( 10.0 * (1.0 / 6.928330802e+06) * (x[getOffset(filterState,  -9)] + x[getOffset(filterState,  -1)]))
		 + ( 45.0 * (1.0 / 6.928330802e+06) * (x[getOffset(filterState,  -8)] + x[getOffset(filterState,  -2)]))
		 + (120.0 * (1.0 / 6.928330802e+06) * (x[getOffset(filterState,  -7)] + x[getOffset(filterState,  -3)]))
		 + (210.0 * (1.0 / 6.928330802e+06) * (x[getOffset(filterState,  -6)] + x[getOffset(filterState,  -4)]))
		 + (252.0 * (1.0 / 6.928330802e+06) *  x[getOffset(filterState,  -5)])

		 + (  -0.4441854896 * y[getOffset(filterState, -10)])
		 + (   4.2144719035 * y[getOffset(filterState,  -9)])
		 + ( -18.5365677633 * y[getOffset(filterState,  -8)])
		 + (  49.7394321983 * y[getOffset(filterState,  -7)])
		 + ( -90.1491003509 * y[getOffset(filterState,  -6)])
		 + ( 115.3235358151 * y[getOffset(filterState,  -5)])
		 + (-105.4969191433 * y[getOffset(filterState,  -4)])
		 + (  68.1964705422 * y[getOffset(filterState,  -3)])
		 + ( -29.8484881821 * y[getOffset(filterState,  -2)])
		 + (   8.0012026712 * y[getOffset(filterState,  -1)]);
	}

	public static void applyLowpass(FilterState filterState, short[] input, short[] output, int length) {
		for (int i = 0; i < length; ++i) {
			pushSample(filterState, input[i]);
			applyLowpass(filterState);
			output[i] = getOutputSample(filterState);
		}
	}
}

Since all of the Java runtimes tested don’t exploit static optimization opportunities as well as it seems that they could, here is an optimized version that has been inlined and has the modulo replaced with a bit mask:

package com.example.perftest;

public class DspJavaManuallyOptimized {
	public static class FilterState {
		static final int size = 16;

		final double input[] = new double[size];
		final double output[] = new double[size];

		int current;
	}

	public static void applyLowpass(FilterState filterState, short[] input, short[] output, int length) {
		for (int i = 0; i < length; ++i) {
			filterState.input[(filterState.current + 0) & (FilterState.size - 1)] = input[i];
			++filterState.current;
			final double[] x = filterState.input;
			final double[] y = filterState.output;

			y[(filterState.current + 0) & (FilterState.size - 1)] =
			   (  1.0 * (1.0 / 6.928330802e+06) * (x[(filterState.current + -10) & (FilterState.size - 1)] + x[(filterState.current + -0) & (FilterState.size - 1)]))
			 + ( 10.0 * (1.0 / 6.928330802e+06) * (x[(filterState.current + -9) & (FilterState.size - 1)] + x[(filterState.current + -1) & (FilterState.size - 1)]))
			 + ( 45.0 * (1.0 / 6.928330802e+06) * (x[(filterState.current + -8) & (FilterState.size - 1)] + x[(filterState.current + -2) & (FilterState.size - 1)]))
			 + (120.0 * (1.0 / 6.928330802e+06) * (x[(filterState.current + -7) & (FilterState.size - 1)] + x[(filterState.current + -3) & (FilterState.size - 1)]))
			 + (210.0 * (1.0 / 6.928330802e+06) * (x[(filterState.current + -6) & (FilterState.size - 1)] + x[(filterState.current + -4) & (FilterState.size - 1)]))
			 + (252.0 * (1.0 / 6.928330802e+06) *  x[(filterState.current + -5) & (FilterState.size - 1)])

			 + (  -0.4441854896 * y[(filterState.current + -10) & (FilterState.size - 1)])
			 + (   4.2144719035 * y[(filterState.current + -9) & (FilterState.size - 1)])
			 + ( -18.5365677633 * y[(filterState.current + -8) & (FilterState.size - 1)])
			 + (  49.7394321983 * y[(filterState.current + -7) & (FilterState.size - 1)])
			 + ( -90.1491003509 * y[(filterState.current + -6) & (FilterState.size - 1)])
			 + ( 115.3235358151 * y[(filterState.current + -5) & (FilterState.size - 1)])
			 + (-105.4969191433 * y[(filterState.current + -4) & (FilterState.size - 1)])
			 + (  68.1964705422 * y[(filterState.current + -3) & (FilterState.size - 1)])
			 + ( -29.8484881821 * y[(filterState.current + -2) & (FilterState.size - 1)])
			 + (   8.0012026712 * y[(filterState.current + -1) & (FilterState.size - 1)]);
			output[i] = (short) Math.max(Short.MIN_VALUE, Math.min(Short.MAX_VALUE, (short) filterState.output[(filterState.current + 0) & (FilterState.size - 1)]));
		}
	}
}

About the book

Android is booming like never before, with millions of devices shipping every day. In OpenGL ES 2 for Android: A Quick-Start Guide, you’ll learn all about shaders and the OpenGL pipeline, and discover the power of OpenGL ES 2.0, which is much more feature-rich than its predecessor.

It’s never been a better time to learn how to create your own 3D games and live wallpapers. If you can program in Java and you have a creative vision that you’d like to share with the world, then this is the book for you.

Share

44 thoughts on “A performance comparison between Java and C on the Nexus 5”

  1. How fast was Renderscript? That’s the recommended way to write this type of code on Android. It should automatically parallelize, so up to 4 times faster than your c code.

    1. I tried porting the code to Renderscript, but I don’t have experience using Renderscript and I didn’t succeed in getting it to work. If you have an implementation I can plug in and test, I’d be glad to add it to the comparison.

  2. Did you consider using Proguard to optimize je Java code and see how the results are affected?
    It is included in the Android SDK – but not all the optimizations are enabled by default.

    1. I didn’t try it, but I’m not sure how much of a difference it would make on code that is run through a JIT / AOT compiler. I agree it’s worth a shot, and I can try it in a future update.

  3. Not very interesting since 95% of code is just Gluing gui and IO together.

    Also your comparison probably does not ensure that the JIT kicked in.

    1. “Not very interesting since 95% of code is just Gluing gui and IO together.”

      Well, your experiences and opinions are not necessarily representative of everyone else’s. :)

      “Also your comparison probably does not ensure that the JIT kicked in.”

      How much time does the JIT need before it can “kick in”? 1 minute? 5 minutes? 30 minutes? ART also uses ahead-of-time compilation.

    1. Thank you, just noticed it myself as well. Amazing how many times we can read over something and our brain auto-fills in what we expected to read. ;)

  4. Now, correct me if I’m wrong, being that filterState.current is part of a class, and I’m assuming that the Java compiler would compile that to be located in a contiguous block of memory, wouldn’t you gain a further optimization by pulling filterState.current out into a final loop variable that would be located on the stack, and therefore more temporally favorable, cache wise?

    I could be totally off base, but I think it might help if you were to take that out and not get it by referencing through filterState. I’m further supposing that being that this is single threaded, there are no other actors upon this variable, meaning that placing it on the stack, before the matrix calculations, might accomplish something without negative side-affects.

    Like I said though, please correct me if I’m wrong.

    1. I was hoping that would be obvious enough for the compilers to do on their own ;) You’re right, though, this would be good to test out for the “manually-optimized” Java to see if it can speed things up further.

    1. The test driver code is quite simple so I didn’t want to lengthen the post by posting it here since it can easily be reproduced; if there’s enough demand, I’ll clean it up and post it.

  5. Sorry, I call complete bullshit. Once a JIT compiles a repeated section of code, it is comparable with native, and this is the most important point. Your code does not have as many code paths as you think ever.

    If java can new an object faster than malloc, (which it can and does), and the JIT can outrun native code which cannot self-optimize, why are your numbers off? Because these are bullshit tests

    1. JITs have a set of compromises, both inherent in the safety guarantees and semantics of the language and in the time spent on optimization. If you really feel that this is bullshit, the algorithm implementations are there; please feel free to publish your own results. I’m not really vested in the results either way, just thought I had something interesting to share.

  6. To take advantage of these performance gains would you you need to write the entire app in C++? Or can you call optimized executable binaries from Java?

    1. The way I’m doing it is by calling C++ from Java via JNI (Java Native Interface), so you can definitely write apps in both languages. With the NDK, your native code gets compiled to a shared library that you can call from Java, and this is also, in fact, how some of Android’s Java APIs are implemented. There are some overhead & gotcha issues to consider, but those can be minimized by reducing the “surface” of the interface between both languages.

  7. Malloc is actually very slow, even slower if your memory is fragmented. Optimizing memory allocation in C is very viable, and depending on the context you can easily outrun malloc if you really wanted to.

  8. At the moment AggressiveOpts doesn’t really do anything which makes me question *why* you got a performance improvement while using it. Also how the results are reported suggests that this was run as a proper set of experiments. I’m don’t really have the time to run these experiments myself to tell you exactly why this benchmark is busted but the result don’t give my any confidence to conclude anything other than this whole exercise smells like a yet another borked micro benchmark.

    1. Why do you say it doesn’t do anything? I’m curious. As for the benchmark, I wasn’t expecting the post to get this much attention! I’ll prepare a follow-up and share the entire project so people can compile it for themselves, and try out alternatives to see if one or the other can be improved.

  9. Thanks for your article. Having written some benchmark related posts myself I’m somewhat biased, but it’s always striking to see what harsh comments are made from well informed and less informed, but not necessarily less harsh, people and how litte the work for such a post is appreciated.

    1. Thank you for the kind words, Stefan! This one definitely stirred up some controversy. ;) It will be worth doing a follow-up post to try out some of the suggestions made by the commentators.

  10. I work on the Renderscript team and would like to help creating a Renderscript version. Can you send me the complete code?

    1. Hi Jean-Luc, that would be very kind of you guys! I did reply to your email a couple of days ago; hopefully the reply didn’t get lost in spam.

  11. Hey, if you put a Renderscript version, would you please post it here?
    Also, can you please put a simple app project (maybe on GitHub) with all of the different tests?

    BTW, the code on this article has a lot of “&” written with a prefix (like “&” and “>” ) .

    1. If I don’t hear back from the Google guys about Renderscript by a couple of weeks from now, I’ll just go ahead and move the code to GitHub; that should help out with the ampersand issues that keep coming back. ;)

  12. Some apps have important parts that must be as quick as possible:
    games, image/audio/video processing, complex 3d rendering …

    So, even if 5% of the code is doing this, the user should not wait so long for the results.

    Also, the Dalvik heap is quite restricted in terms of how much MB it can hold, while C/C++ is restricted only to what the device has, which is more than enough in most cases.
    I’ve even made a tiny C/C++ library that demostrates it , by allowing you to handle bitmap data, even if it’s quite large :
    https://github.com/AndroidDeveloperLB/AndroidJniBitmapOperations

    1. Definitely agreed! Also, I believe that we can sometimes forget about the other side of the coin — energy & memory usage. More efficiency means that our app is a better citizen on the user’s device, which equals happier users. This doesn’t automatically = NDK, but it also means that it shouldn’t be ruled out, either.

      I wonder if Google would still go with the VM solution if they were making a decision today. Apple has shown that there are ways of reducing some of the pain of native programming with technologies like ARC, and the ease of integration between C, Objective C, and C++ on Apple’s platform is something that I envy when writing JNI code using the NDK.

      I’m curious about the heap restrictions — is that actually still true on recent versions of Android? I would have expected Google to fix that at some point since it seems like an oversight on their part.

  13. I started working on it today. Sorry, other work had to take priority :( We’re not forgetting you!

    1. No worries! I know how busy you guys can get, so I’m very grateful that you’d be willing to take your time for this. No rush!

  14. An update will be coming soon within the next few weeks, with thanks to Jean-Luc Brouillet for porting the test code to Renderscript! When it’s ready, I’ll release the entire project to GitHub so everyone can tinker with it and explore.

Add Comment Register



Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>