Tag Archives: Clang

A performance comparison between Java and C on the Nexus 5

Android phones have been growing ever more powerful with time, with the Nexus 5 sporting a auad-core 2.3 GHz Krait 400; this is a very powerful CPU for a mobile phone. With most Android apps being written in Java, does Java allow us to access all of that power? Or, put another way, is Java efficient enough, allowing tasks to complete more quickly and allowing the CPU to idle more, saving precious battery life?

In this post, I will take a look at a DSP filter adapted from coefficients generated with mkfilter, and compare three different implementations: one in C, one in Java, and one in Java with some manual optimizations. The source for these tests can be downloaded at the end of this post.

To compare the results, I ran the filter over an array of random data on the Nexus 5, and the compared the results to the fastest implementation. In the following table, a lower runtime is better, with the fastest implementation getting a relative runtime of 1.

Execution environment Options Relative runtime (lower is better)
gcc 4.8 1.00
gcc 4.8 (LOCAL_ARM_NEON := true) -ffast-math -O3 1.02
gcc 4.8 -ffast-math -O3 1.05
clang 3.4 (LOCAL_ARM_NEON := true) -ffast-math -O3 1.27
clang 3.4 -ffast-math -O3 1.42
clang 3.4 1.43
ART (manually-optimized) 2.22
Dalvik (manually-optimized) 2.87
ART (normal code) 7.99
Dalvik (normal code) 17.78

The statically-compiled C code gave the best execution times, followed by ART and then by Dalvik. The C code uses JNI via GetShortArrayRegion and SetShortArrayRegion to marshal the data from Java to C, and then back from C to Java once processing has completed.

The best performance came courtesy of GCC 4.8, with little variation between the different additional optimization options. Clang’s ARM builds are not quite as optimized as GCC’s; toggling LOCAL_ARM_NEON := true in the NDK makefile also makes a clear difference in performance.

Even the slowest native build using clang is not more than 43% slower than the best native build using gcc. Once we switch to Java, the variance starts to increase significantly, with the best runtime about 2.2x slower than native code, and the worst runtime a staggering 17.8x slower.

What explains the large difference? For one, it appears that both ART and Dalvik are limited in the amount of static optimizations that they are capable of. This is understandable in the case of Dalvik, since it uses a JIT and it’s also much older, but it is disappointing in the case of ART, since it uses ahead-of-time compilation.

Is there a way to speed up the Java code? I decided to try it out, by applying the same static optimizations I would have expected the compiler to do, like converting modulo to bit masks and inlining function calls. These changes resulted in one massive and hard to read function, but they also dramatically improved the runtime performance, with Dalvik speeding up from a 17.8x penalty to 2.9x, and ART speeding up from an 8.0x penalty to 2.2x.

The downside of this is that the code has to be abused to get this additional performance, and it still doesn’t come close to matching the ahead-of-time code generated by gcc and clang, which can surpass that performance without similar abuse of the code. The NDK is still a viable option for those looking for improved performance and more efficient code which consumes less battery over time.

Just for fun, I decided to try things out on a laptop with a 2.6 GHz Intel Core i7. For this table, the relative results are in the other direction, with 1x corresponding to the best time on the Nexus 5, 2x being twice as fast, and so on. The table starts with the best results first, as before.

Execution environment Options Relative speed (higher is better)
clang 3.4 -O3 -ffast-math -flto 8.38x
clang 3.4 -O3 -ffast-math 6.09x
Java SE 1.7u51 (manually-optimized) -XX:+AggressiveOpts 5.25x
Java SE 1.6u65 (manually-optimized) 3.85x
Java SE 1.6 (normal code) 2.44x

As on the Nexus 5, the C code runs faster, but to Java’s credit, the gap between the best & result result is less than 4x, which is much less variance than we see with Dalvik or ART. Java 1.6 and 1.7 are very close to each other, unless “-XX:+AggressiveOpts” is used; with that option enabled, 1.7 is able to pull ahead.

There is still an unfortunate gap between the “normal” code and the manually-optimized code, which really should be closable with static analysis and inlining.

The other interesting result is that the gap between mobile and PC is closing over time, and even more so if you take power consumption into account. It’s quite impressive to see that as far as single-core performance goes, the PC and smartphone are closer than ever.

Conclusion

Recent Android devices are getting very powerful, and with the new ART runtime, common Java code can be executed quickly enough to keep user interfaces responsive and users happy.

Sometimes, though, we need to go further, and write demanding code that needs to run quickly and efficiently. With the latest Android devices, these algorithms may be able to run quickly enough in the Dalvik VM or with ART, but then we have to ask ourselves: is the benefit of using a single language worth the cost of lower performance? This isn’t just an academic question: lower performance means that we need to ask our users to give us more CPU cycles, which shortens their device’s battery life, heats up their phones, and makes them wait longer for results, and all because we didn’t want to write the code in another language.

For these reasons, writing some of our code in C/C++, FORTRAN, or another native language can still make a lot of sense.

For more reading on this topic, check out How Powerful is Your Nexus 7?

Source

dsp.c
#include "dsp.h"
#include <algorithm>
#include <cstdint>
#include <limits>

static constexpr int int16_min = std::numeric_limits<int16_t>::min();
static constexpr int int16_max = std::numeric_limits<int16_t>::max();

static inline int16_t clamp(int input)
{
     return std::max(int16_min, std::min(int16_max, input));
}

static inline int get_offset(const FilterState& filter_state, int relative_offset)
{
     return (filter_state.current + relative_offset) % filter_state.size;
}

static inline void push_sample(FilterState& filter_state, int16_t sample)
{
     filter_state.input[get_offset(filter_state, 0)] = sample;
     ++filter_state.current;
}

static inline int16_t get_output_sample(const FilterState& filter_state)
{
     return clamp(filter_state.output[get_offset(filter_state, 0)]);
}

static inline void apply_lowpass(FilterState& filter_state)
{
     double* x = filter_state.input;
     double* y = filter_state.output;

     y[get_offset(filter_state, 0)] =
       (  1.0 * (1.0 / 6.928330802e+06) * (x[get_offset(filter_state, -10)] + x[get_offset(filter_state,  -0)]))
     + ( 10.0 * (1.0 / 6.928330802e+06) * (x[get_offset(filter_state,  -9)] + x[get_offset(filter_state,  -1)]))
     + ( 45.0 * (1.0 / 6.928330802e+06) * (x[get_offset(filter_state,  -8)] + x[get_offset(filter_state,  -2)]))
     + (120.0 * (1.0 / 6.928330802e+06) * (x[get_offset(filter_state,  -7)] + x[get_offset(filter_state,  -3)]))
     + (210.0 * (1.0 / 6.928330802e+06) * (x[get_offset(filter_state,  -6)] + x[get_offset(filter_state,  -4)]))
     + (252.0 * (1.0 / 6.928330802e+06) *  x[get_offset(filter_state,  -5)])

     + (  -0.4441854896 * y[get_offset(filter_state, -10)])
     + (   4.2144719035 * y[get_offset(filter_state,  -9)])
     + ( -18.5365677633 * y[get_offset(filter_state,  -8)])
     + (  49.7394321983 * y[get_offset(filter_state,  -7)])
     + ( -90.1491003509 * y[get_offset(filter_state,  -6)])
     + ( 115.3235358151 * y[get_offset(filter_state,  -5)])
     + (-105.4969191433 * y[get_offset(filter_state,  -4)])
     + (  68.1964705422 * y[get_offset(filter_state,  -3)])
     + ( -29.8484881821 * y[get_offset(filter_state,  -2)])
     + (   8.0012026712 * y[get_offset(filter_state,  -1)]);
}

void apply_lowpass(FilterState& filter_state, const int16_t* input, int16_t* output, int length)
{
     for (int i = 0; i < length; ++i) {
          push_sample(filter_state, input[i]);
          apply_lowpass(filter_state);
          output[i] = get_output_sample(filter_state);
     }
}
dsp.h
#include <cstdint>

struct FilterState {
	static constexpr int size = 16;

    double input[size];
    double output[size];
	unsigned int current;

	FilterState() : input{}, output{}, current{} {}
};

void apply_lowpass(FilterState& filter_state, const int16_t* input, int16_t* output, int length);

Here is the Java adaptation of the C code:

package com.example.perftest;

import com.example.perftest.DspJavaManuallyOptimized.FilterState;

public class DspJava {
	public static class FilterState {
		static final int size = 16;

		final double input[] = new double[size];
		final double output[] = new double[size];

		int current;
	}

	static short clamp(short input) {
		return (short) Math.max(Short.MIN_VALUE, Math.min(Short.MAX_VALUE, input));
	}

	static int getOffset(FilterState filterState, int relativeOffset) {
		return ((filterState.current + relativeOffset) % FilterState.size + FilterState.size) % FilterState.size;
	}

	static void pushSample(FilterState filterState, short sample) {
		filterState.input[getOffset(filterState, 0)] = sample;
		++filterState.current;
	}

	static short getOutputSample(FilterState filterState) {
		return clamp((short) filterState.output[getOffset(filterState, 0)]);
	}
	
	static void applyLowpass(FilterState filterState) {
		final double[] x = filterState.input;
		final double[] y = filterState.output;

		y[getOffset(filterState, 0)] =
		   (  1.0 * (1.0 / 6.928330802e+06) * (x[getOffset(filterState, -10)] + x[getOffset(filterState,  -0)]))
		 + ( 10.0 * (1.0 / 6.928330802e+06) * (x[getOffset(filterState,  -9)] + x[getOffset(filterState,  -1)]))
		 + ( 45.0 * (1.0 / 6.928330802e+06) * (x[getOffset(filterState,  -8)] + x[getOffset(filterState,  -2)]))
		 + (120.0 * (1.0 / 6.928330802e+06) * (x[getOffset(filterState,  -7)] + x[getOffset(filterState,  -3)]))
		 + (210.0 * (1.0 / 6.928330802e+06) * (x[getOffset(filterState,  -6)] + x[getOffset(filterState,  -4)]))
		 + (252.0 * (1.0 / 6.928330802e+06) *  x[getOffset(filterState,  -5)])

		 + (  -0.4441854896 * y[getOffset(filterState, -10)])
		 + (   4.2144719035 * y[getOffset(filterState,  -9)])
		 + ( -18.5365677633 * y[getOffset(filterState,  -8)])
		 + (  49.7394321983 * y[getOffset(filterState,  -7)])
		 + ( -90.1491003509 * y[getOffset(filterState,  -6)])
		 + ( 115.3235358151 * y[getOffset(filterState,  -5)])
		 + (-105.4969191433 * y[getOffset(filterState,  -4)])
		 + (  68.1964705422 * y[getOffset(filterState,  -3)])
		 + ( -29.8484881821 * y[getOffset(filterState,  -2)])
		 + (   8.0012026712 * y[getOffset(filterState,  -1)]);
	}

	public static void applyLowpass(FilterState filterState, short[] input, short[] output, int length) {
		for (int i = 0; i < length; ++i) {
			pushSample(filterState, input[i]);
			applyLowpass(filterState);
			output[i] = getOutputSample(filterState);
		}
	}
}

Since all of the Java runtimes tested don’t exploit static optimization opportunities as well as it seems that they could, here is an optimized version that has been inlined and has the modulo replaced with a bit mask:

package com.example.perftest;

public class DspJavaManuallyOptimized {
	public static class FilterState {
		static final int size = 16;

		final double input[] = new double[size];
		final double output[] = new double[size];

		int current;
	}

	public static void applyLowpass(FilterState filterState, short[] input, short[] output, int length) {
		for (int i = 0; i < length; ++i) {
			filterState.input[(filterState.current + 0) & (FilterState.size - 1)] = input[i];
			++filterState.current;
			final double[] x = filterState.input;
			final double[] y = filterState.output;

			y[(filterState.current + 0) & (FilterState.size - 1)] =
			   (  1.0 * (1.0 / 6.928330802e+06) * (x[(filterState.current + -10) & (FilterState.size - 1)] + x[(filterState.current + -0) & (FilterState.size - 1)]))
			 + ( 10.0 * (1.0 / 6.928330802e+06) * (x[(filterState.current + -9) & (FilterState.size - 1)] + x[(filterState.current + -1) & (FilterState.size - 1)]))
			 + ( 45.0 * (1.0 / 6.928330802e+06) * (x[(filterState.current + -8) & (FilterState.size - 1)] + x[(filterState.current + -2) & (FilterState.size - 1)]))
			 + (120.0 * (1.0 / 6.928330802e+06) * (x[(filterState.current + -7) & (FilterState.size - 1)] + x[(filterState.current + -3) & (FilterState.size - 1)]))
			 + (210.0 * (1.0 / 6.928330802e+06) * (x[(filterState.current + -6) & (FilterState.size - 1)] + x[(filterState.current + -4) & (FilterState.size - 1)]))
			 + (252.0 * (1.0 / 6.928330802e+06) *  x[(filterState.current + -5) & (FilterState.size - 1)])

			 + (  -0.4441854896 * y[(filterState.current + -10) & (FilterState.size - 1)])
			 + (   4.2144719035 * y[(filterState.current + -9) & (FilterState.size - 1)])
			 + ( -18.5365677633 * y[(filterState.current + -8) & (FilterState.size - 1)])
			 + (  49.7394321983 * y[(filterState.current + -7) & (FilterState.size - 1)])
			 + ( -90.1491003509 * y[(filterState.current + -6) & (FilterState.size - 1)])
			 + ( 115.3235358151 * y[(filterState.current + -5) & (FilterState.size - 1)])
			 + (-105.4969191433 * y[(filterState.current + -4) & (FilterState.size - 1)])
			 + (  68.1964705422 * y[(filterState.current + -3) & (FilterState.size - 1)])
			 + ( -29.8484881821 * y[(filterState.current + -2) & (FilterState.size - 1)])
			 + (   8.0012026712 * y[(filterState.current + -1) & (FilterState.size - 1)]);
			output[i] = (short) Math.max(Short.MIN_VALUE, Math.min(Short.MAX_VALUE, (short) filterState.output[(filterState.current + 0) & (FilterState.size - 1)]));
		}
	}
}
Share

Site Updates, and Thoughts on Native Development for the Web

I’ve recently been spending time travelling overseas, taking a bit of a break after reaching an important milestone with the book, and also taking a bit of a rest from working for myself! The trip has been good so far, and I’ve even been keeping up to date with items from the RSS feed. Here is some of the news that I wanted to share with y’all, as well as to get your thoughts:

Book nearing production

OpenGL ES for Android: A Quick-Start Guide reached its final beta a couple of weeks ago, and is now being readied to be sent off to the printers. I would like to thank everyone again for their feedback and support; I am so grateful for it, and happy that the book is now going out the door. I’d also like to give a special thanks to Mario Zechner, the creator behind libgdx and Beginning Android Games, for generously contributing his foreword and a lot of valuable feedback!

Site news

Not too long ago, I decided to add a new forums section to the site to hopefully build up some more community involvement and get a two-way dialogue going; unfortunately, things didn’t quite take off. The forums have also suffered from spam and some technical issues, and recently I was even locked out of the forum administration. I have no idea what happened or how to fix it, so since the posting rate was low, I am just putting the forums on ice for now.

I’d still love to find a way to have some more discussions happening on the site. In which other ways do you believe that I could improve the site so that I could encourage this? I’d love to hear your thoughts.

Topics to explore further

I’ve also been thinking about new topics to explore and write about, as a lot of exciting things are happening with 3D on the mobile and web. One big trend that seems to be taking place: Native is making a comeback.

For many years,  C and C++ were proclaimed to be dead languages, lingering around only for legacy reasons, and soon to be replaced by the glorious world of managed languages. Having started out my own development career in Java, I can agree that the Java world does have a lot of advantages. The language is easier to learn than a behemoth like C++, and, at least on the desktop, the performance on the JVM can even come close to rivalling native languages.

So, why the resurgence in C and C++? Here are some of my thoughts:

  • The world is not just limited to the desktop anymore, and there are more important platforms to target than ever before. C and C++ excel at cross-platform portability, as just about every platform has a C/C++ compiler. By contrast, the JVM and .NET runtimes are limited to certain platforms, and Android’s Dalvik VM is not as good as the JVM in producing fast, efficient JIT compiled code. Yes, there are bytecode translators and commercial alternatives such as Xamarin’s Mono platforms for mobile, but this comes with its own set of disadvantages.
  • Resource usage can be more important than programmer productivity. This is true in big, expensive data centers, and it’s also true on mobile, where smaller downloads and lower battery usage can lead to happier customers.
  • C and C++ are still king when it comes to fast, efficient compiled code that can be compiled almost anywhere. Other native would-be competitors lose out because they are either not as fast or not as widely available on the different platforms. When productivity becomes more important than performance, these alternatives also get squeezed out by the managed and scripting languages.

As much as C and C++ excel at the things they’re good at, they also come with a lot of legacy cruft. C++ is a huge language, and it gets larger with each new standard. On the other hand, at least the compilers give you some freedom. Don’t want to use the STL? Roll out your own custom containers. Don’t want the cost/limitations of exception handling and RTTI? Compile with -fno-exceptions and -fno-rtti. Undefined behavior is another nasty issue which can rear its head, though compilers like Clang now feature additional tools to help catch and fix these errors. With data-oriented design and sensible error handling, C++ code can be both fast and maintainable.

Compiling C and C++ to the web

With tools like emscripten, you can now compile your C/C++ code to JavaScript and run it in a browser, and if you use the asm.js subset, it can actually run with very good performance, enough to run a modern 3D game using JavaScript and WebGL. I’ve always been skeptical of the whole “JavaScript everywhere” meme, because how can the web truly become an open computing platform by forcing the use of one language for everything? There’s no way a single language can be equally suitable for all tasks, and why would I want to develop a second code base just for the web? For this reason, I used to believe that Google’s Native Client held more promise, since it can run native code with almost no speed loss. Why use JavaScript when you can just execute directly on the CPU and bring your existing code over?

Now I see things a bit differently and I think that the asm.js approach has a lot of merit to it. NaCl has been around for years now, and it still only runs in Google Chrome, and then only on certain platforms and only if the software is distributed through the Chrome store, or if the user enables a developer flag. The asm.js approach, on the other end, can run on every browser that supports modern JavaScript. This approach is also portable, meaning it will work into the foreseeable future, even on new device architectures. NaCl, on the other hand, is limited to what was compiled. Portable NaCl is supposed to fix this, but it’s been a work-in-progress for years now, and given the experience with NaCl, it may never find its way to another browser besides Google Chrome. Combined with WebGL, compiling to JavaScript really opens up the web to a lot of new possibilities, one where you can deploy across the web without being tied to a single browser or plugin. The BananaBread demo shows just some of what is possible.

I’d like to learn more about writing OpenGL apps that can run on Android, iOS, and the web, all with a single code base in C++. I know that this is also possible with Java by using Google’s Web Toolkit and bytecode translators (after all, this is how libgdx does it), but I’d like to learn something different, outside of the Java sphere. Is this something that you guys would be interested in reading more of? This is all relatively new to me and I’m currently exploring, so as always, looking forward to your feedback. :)

Update: I am now developing an air hockey project here: Developing a Simple Game of Air Hockey Using C++ and OpenGL ES 2 for Android, iOS, and the Web

Share