A performance comparison redux: Java, C, and Renderscript on the Nexus 5

In my previous post on this topic, A performance comparison between Java and C on the Nexus 5, I compared the performance of an audio low-pass filter in Java and C. The results were clear: The C version outperformed, and by a significant amount. This result brought more attention to the post than I was expecting; some of you were also curious about RenderScript, and I’m pleased to say that Jean-Luc Brouillet, a member of Google’s RenderScript team, got in touch with me and generously volunteered an implementation of the DSP code in RenderScript.

With this new code, I refactored the code into a new benchmark with test audio data, so that I could compare the different implementations and verify their output. I’ll be sharing both the code and the results with you today.

Motivations and intentions

Some of you might be curious about why I am so interested in this subject. 🙂 I normally spend most of my development hours coding for Android, using Java; in fact, my first book, OpenGL ES 2 for Android: A Quick-Start Guide, is a beginner’s guide to OpenGL that focuses on Android and Java.

Normally, when I develop code, the most important questions on my mind are: “Is this easy to maintain?” “Is it correct?” “If I come back and revisit this code a month later, am I going to understand what the heck I was doing?” Since Java is the primary development language on Android, it just makes sense for me to do most of my development there.

So why the recent focus on native development? Here are two big reasons:

  • The performance of Java on Android isn’t suitable for everything. For critical performance paths, it can be a big competitive advantage to move that code over to native, so that it completes in less time and uses less battery.
  • I’m interested in branching out to other platforms down the road, probably starting with iOS, and I’m curious if it makes sense to share some code between iOS and Android using a common code base in C/C++. It’s important that this code runs without many abstractions in the way, so I’m not very interested in custom/proprietary solutions like Xamarin’s C# or an HTML5-based toolkit.

It’s starting to become clear to me that it can make sense to work with more than one language, and to choose these languages in situations where the benefits outweigh the cost. Trying to work with Android’s UI toolkit from C++ is painful; running a DSP filter from Java and watching it use more battery and take more time than it needs to is just as painful.

Our new test scenario

For this round of benchmarks, we’ll be comparing several different implementations of a low-pass IIR filter, implemented with coefficients generated with mkfilter. We’ll run a test audio file through each implementation, and record the best score for each.

How does the test work?

  1. First, we load a test audio file into memory.
  2. We then execute the DSP algorithm over the test audio, benchmarking the elapsed time. The data is processed in chunks to reflect the fact that this is similar to how we would process data coming off of the microphone.
  3. The results are written to a new audio file in the device’s default storage, under “PerformanceTest/”.

Here are our test implementations:

  1. Java. This is a straightforward implementation of the algorithm.
  2. Java (tuned). This is the same as 1, but with all of the functions manually inlined.
  3. C. This uses the Java Native Interface (JNI) to pass the data back and forth.
  4. RenderScript. A big thank you to Mr. Brouillet from the RenderScript team for taking the time to contribute this!

The tests were run on a Nexus 5 device running Android 4.4.3. Here are the results:

Results

Implementation Execution environment Compiler Shorts/second Relative run time
(lower is better)
C Dalvik JNI gcc 4.6 17,639,081 1.00
C Dalvik JNI gcc 4.8 16,516,757 1.07
RenderScript Dalvik RenderScript (API 19) 15,206,495 1.16
RenderScript Dalvik RenderScript (API 18) 13,234,397 1.33
C Dalvik JNI clang 3.4 13,208,408 1.34
Java (tuned) Art (Proguard) 7,235,607 2.44
Java (tuned) Art 7,097,363 2.49
Java (tuned) Dalvik 5,678,990 3.11
Java (tuned) Dalvik (Proguard) 5,365,814 3.29
Java Art (Proguard) 3,512,426 5.02
Java Art 3,049,986 5.78
Java Dalvik (Proguard) 1,220,600 14.45
Java Dalvik 1,083,015 16.29

 

For this test, the C implementation is the king of the hill, with gcc 4.6 giving the best performance. The gcc compiler is followed by RenderScript and clang 3.4, and the two Java implementations are at the back of the pack, with Dalvik giving the worst performance.

C

The C implementation compiled with gcc gave the best performance out of the entire group. All tests were done with -ffast-math and -O3, using the NDK r9d. Switching between Dalvik and ART had no impact on the C run times.

I’m not sure why there is still a large gap between clang and gcc; would everything on iOS run that much faster if Apple was using gcc?  Clang will likely continue to improve and I hope to see this gap closed in the future. I’m also curious about why gcc 4.6 seems to generate better code than 4.8. Perhaps someone familiar with ARM assembly and the compilers would be able to weigh in why?

Even though I’m a newbie at C and I learned about JNI in part by doing these benchmarks, I didn’t find the code overly difficult to write. There’s enough documentation out there that I was able to figure things out, and the algorithm output matches that of the other implementations; however, since C is an unsafe language, I’m not entirely convinced that I haven’t stumbled into undefined behaviour or otherwise done something insane. 🙂

RenderScript

In the previous post, someone asked about RenderScript, so I started working on an implementation. Unfortunately, I had zero experience with RenderScript at the time so I wasn’t able to get it working. Luckily, Jean-Luc Brouillet from the RenderScript team also saw the post and ported over the algorithm for me!

As you can see, the results are very promising: RenderScript offers better performance than clang and almost the same performance as gcc, without requiring use of the NDK or of JNI glue! As with C, switching between Dalvik and ART had no impact on the run times.

RenderScript also offers the possibility to easily parallelize the code and/or run it on the GPU which can potentially give a huge speedup, though unfortunately we weren’t able to take advantage of that here since this particular DSP algorithm is not trivially parallelizable. However, for other algorithms like a simple gain, RenderScript can give a significant boost with small changes to the code, and without having to worry about threading or other such headaches.

In my humble view, the RenderScript implementation does need some more polishing and the documentation needs to be significantly improved, as I doubt I would have gotten it working on my own without help. Here are some of the issues that I ran into with the RenderScript port:

  • Not all functions are documented. For example, the algorithm uses rsSetElementAt_short() which I can’t find anywhere except for some obscure C files in the Android source code.
  • The allocation functions are missing a way to copy data into an offset of an array. To work around this, I use a scratch buffer and System.arraycopy() to move the data around, and to keep things fair, I changed the other implementations to work in the same way. While this slows them down slightly, I don’t believe it’s an unfair advantage for RenderScript, because in real-world usage, I would expect to process the data coming off the microphone and write that directly into a file, not into an offset of some array.
  • The fastest RenderScript implementation only works on Android 4.4 KitKat devices. Going down one version to Android 4.3 changes the RenderScript API which requires me to change the code slightly, slowing things down for both 4.3 and 4.4. RenderScript does offer a “support” mode via the support API which should enable backwards compatibility, but I wasn’t able to get this to work for me for APIs older than 18 (Android 4.3).

So while there are some issues with RenderScript as implemented today, these are all issues that can hopefully be fixed. RenderScript also has the significant advantage of running code on the CPU and GPU in parallel, and doesn’t require JNI glue code. This makes it a serious contender to C, especially if portability to older devices or other platforms is not a big concern.

Java

As with last time, Java fills out the bottom of the pack. The performance is especially terrible with the default Dalvik implementation; in fact, it would be even worse if I hadn’t manually replaced the modulo operator with a bit mask, which I was hoping the compiler could do with the static information available to it, but it doesn’t.

Some people asked about Proguard, so I tried it out with the following config (full details in the test project):

-optimizationpasses 5
-allowaccessmodification
-dontpreverify

-dontusemixedcaseclassnames
-dontskipnonpubliclibraryclasses
-verbose

The results were mixed. Switching between Dalvik and ART made much more of a difference, as did manually inlining all of the functions together. The best result with Dalvik was without Proguard, and was 3.11x slower than the best C implementation. The best result with ART was with Proguard, and was 2.44x slower than the best C implementation. If we compare the normal Java version to the best C result, we get a 5.02x slowdown with ART and a 14.45x slowdown with Dalvik.

It does look like the performance of Java will be getting a lot better once ART becomes widely deployed; instead of huge slowdowns, we’ll be seeing between 3x and 5x, which does make a difference. I can already see the improvements when sorting and displaying ListViews in UI code, so this isn’t just something that affects niche code like audio DSP filters.

Desktop results (just for fun)

Just like last time, again, here are some desktop results, just for fun. 🙂 These tests were run on a 2.6 GHz Intel Core i7 running OS X 10.9.3.

Implementation Execution environment Compiler Shorts/second Relative speed
(higher is better)
C Java SE 1.6u65 JNI gcc 4.9 129,909,662 7.36
C Java SE 1.6u65 JNI clang 3.4 96,022,644 5.44
Java Java SE 1.8u5 (+XX:AggressiveOpts) 82,988,332 4.70
Java (tuned) Java SE 1.8u5 (+XX:AggressiveOpts) 79,288,025 4.50
Java Java SE 1.8u5 64,964,399 3.68
Java (tuned) Java SE 1.8u5 64,748,201 3.67
Java (tuned) Java SE 1.6u65 63,965,575 3.63
Java Java SE 1.6u65 53,245,864 3.02

 

As on the Nexus 5, the C implementation compiled with gcc dominates; however, I’m very impressed with where Java ended up!

C

I used the following compilers with optimization flags -march=native -ffast-math -O3:

  • Apple LLVM version 5.1 (clang-503.0.40) (based on LLVM 3.4svn)
  • gcc version 4.9.0 20140416 (prerelease) (MacPorts gcc49 4.9-20140416_2)

As on the Nexus 5, gcc’s generated code is much faster than clang’s; perhaps this will change in the future but for now, gcc is still the king. I also find it interesting that the gap between the best run time here and the best run time on the Nexus 5 is similar to the gap between C and ART on the Nexus 5. Not so far apart, they are!

Java

I’m also impressed with the latest Java for OS X. While manually inlining all of the functions together was required for an improvement on Java 1.6, the manually-inlined version was actually slower on Java 1.8. This shows that not only is this sort of code abuse no longer required on the latest Java, but also that the compiler is smarter than we are at optimizing the code.

Adding +XX:AggressiveOpts to Java 1.8 sped things up even more, almost closing the gap with clang! That is very impressive in my eyes, since Java has an old reputation of being a slow language, but in some cases and situations, it can be almost as fast as C if not faster.

The worst Java performance is 2.43x slower than the best C performance, which is about the same relative difference as the best Java performance on Android with ART. Performance differences aren’t always just about language choice; they can also be very dependent on the quality of implementation. At this time, the Google team has made different trade-offs which place ART at around the same relative level of performance, for this specific test case, as Java 1.6. The improved performance of Java 1.8 on the desktop shows that it’s clearly possible to close up the gap on Android in the future.

Explore the code!

The project can be downloaded at GitHub: https://github.com/learnopengles/dsp-perf-test. To compile the code, download or clone the repository and import the projects into Eclipse with File->Import->Existing Projects Into Workspace. If the Android project is missing references, go to its properties, Java Build Path, Projects, and add “JavaPerformanceTest”.

The results are written to “PerformanceTest/” on the device’s default storage, so please double-check that you don’t have anything there before running the tests.

So, what do you think? Does it make sense to drop down into native code? Or are native languages a relic of the past, and there’s no reason to use anything other than modern, safe languages? I would love to hear your feedback.

A performance comparison between Java and C on the Nexus 5

Android phones have been growing ever more powerful with time, with the Nexus 5 sporting a quad-core 2.3 GHz Krait 400; this is a very powerful CPU for a mobile phone. With most Android apps being written in Java, does Java allow us to access all of that power? Or, put another way, is Java efficient enough, allowing tasks to complete more quickly and allowing the CPU to idle more, saving precious battery life?

(Note: An updated version of this comparison is available at A Performance Comparison Redux: Java, C, and Renderscript on the Nexus 5, along with source code).

In this post, I will take a look at a DSP filter adapted from coefficients generated with mkfilter, and compare three different implementations: one in C, one in Java, and one in Java with some manual optimizations. The source for these tests can be downloaded at the end of this post.

To compare the results, I ran the filter over an array of random data on the Nexus 5, and the compared the results to the fastest implementation. In the following table, a lower runtime is better, with the fastest implementation getting a relative runtime of 1.

Execution environment Options Relative runtime (lower is better)
gcc 4.8 1.00
gcc 4.8 (LOCAL_ARM_NEON := true) -ffast-math -O3 1.02
gcc 4.8 -ffast-math -O3 1.05
clang 3.4 (LOCAL_ARM_NEON := true) -ffast-math -O3 1.27
clang 3.4 -ffast-math -O3 1.42
clang 3.4 1.43
ART (manually-optimized) 2.22
Dalvik (manually-optimized) 2.87
ART (normal code) 7.99
Dalvik (normal code) 17.78

The statically-compiled C code gave the best execution times, followed by ART and then by Dalvik. The C code uses JNI via GetShortArrayRegion and SetShortArrayRegion to marshal the data from Java to C, and then back from C to Java once processing has completed.

The best performance came courtesy of GCC 4.8, with little variation between the different additional optimization options. Clang’s ARM builds are not quite as optimized as GCC’s; toggling LOCAL_ARM_NEON := true in the NDK makefile also makes a clear difference in performance.

Even the slowest native build using clang is not more than 43% slower than the best native build using gcc. Once we switch to Java, the variance starts to increase significantly, with the best runtime about 2.2x slower than native code, and the worst runtime a staggering 17.8x slower.

What explains the large difference? For one, it appears that both ART and Dalvik are limited in the amount of static optimizations that they are capable of. This is understandable in the case of Dalvik, since it uses a JIT and it’s also much older, but it is disappointing in the case of ART, since it uses ahead-of-time compilation.

Is there a way to speed up the Java code? I decided to try it out, by applying the same static optimizations I would have expected the compiler to do, like converting modulo to bit masks and inlining function calls. These changes resulted in one massive and hard to read function, but they also dramatically improved the runtime performance, with Dalvik speeding up from a 17.8x penalty to 2.9x, and ART speeding up from an 8.0x penalty to 2.2x.

The downside of this is that the code has to be abused to get this additional performance, and it still doesn’t come close to matching the ahead-of-time code generated by gcc and clang, which can surpass that performance without similar abuse of the code. The NDK is still a viable option for those looking for improved performance and more efficient code which consumes less battery over time.

Just for fun, I decided to try things out on a laptop with a 2.6 GHz Intel Core i7. For this table, the relative results are in the other direction, with 1x corresponding to the best time on the Nexus 5, 2x being twice as fast, and so on. The table starts with the best results first, as before.

Execution environment Options Relative speed (higher is better)
clang 3.4 -O3 -ffast-math -flto 8.38x
clang 3.4 -O3 -ffast-math 6.09x
Java SE 1.7u51 (manually-optimized) -XX:+AggressiveOpts 5.25x
Java SE 1.6u65 (manually-optimized) 3.85x
Java SE 1.6 (normal code) 2.44x

As on the Nexus 5, the C code runs faster, but to Java’s credit, the gap between the best & worst result is less than 4x, which is much less variance than we see with Dalvik or ART. Java 1.6 and 1.7 are very close to each other, unless “-XX:+AggressiveOpts” is used; with that option enabled, 1.7 is able to pull ahead.

There is still an unfortunate gap between the “normal” code and the manually-optimized code, which really should be closable with static analysis and inlining.

The other interesting result is that the gap between mobile and PC is closing over time, and even more so if you take power consumption into account. It’s quite impressive to see that as far as single-core performance goes, the PC and smartphone are closer than ever.

Conclusion

Recent Android devices are getting very powerful, and with the new ART runtime, common Java code can be executed quickly enough to keep user interfaces responsive and users happy.

Sometimes, though, we need to go further, and write demanding code that needs to run quickly and efficiently. With the latest Android devices, these algorithms may be able to run quickly enough in the Dalvik VM or with ART, but then we have to ask ourselves: is the benefit of using a single language worth the cost of lower performance? This isn’t just an academic question: lower performance means that we need to ask our users to give us more CPU cycles, which shortens their device’s battery life, heats up their phones, and makes them wait longer for results, and all because we didn’t want to write the code in another language.

For these reasons, writing some of our code in C/C++, FORTRAN, or another native language can still make a lot of sense.

For more reading on this topic, check out How Powerful is Your Nexus 7?

Source

dsp.c
#include "dsp.h"
#include <algorithm>
#include <cstdint>
#include <limits>

static constexpr int int16_min = std::numeric_limits<int16_t>::min();
static constexpr int int16_max = std::numeric_limits<int16_t>::max();

static inline int16_t clamp(int input)
{
     return std::max(int16_min, std::min(int16_max, input));
}

static inline int get_offset(const FilterState& filter_state, int relative_offset)
{
     return (filter_state.current + relative_offset) % filter_state.size;
}

static inline void push_sample(FilterState& filter_state, int16_t sample)
{
     filter_state.input[get_offset(filter_state, 0)] = sample;
     ++filter_state.current;
}

static inline int16_t get_output_sample(const FilterState& filter_state)
{
     return clamp(filter_state.output[get_offset(filter_state, 0)]);
}

static inline void apply_lowpass(FilterState& filter_state)
{
     double* x = filter_state.input;
     double* y = filter_state.output;

     y[get_offset(filter_state, 0)] =
       (  1.0 * (1.0 / 6.928330802e+06) * (x[get_offset(filter_state, -10)] + x[get_offset(filter_state,  -0)]))
     + ( 10.0 * (1.0 / 6.928330802e+06) * (x[get_offset(filter_state,  -9)] + x[get_offset(filter_state,  -1)]))
     + ( 45.0 * (1.0 / 6.928330802e+06) * (x[get_offset(filter_state,  -8)] + x[get_offset(filter_state,  -2)]))
     + (120.0 * (1.0 / 6.928330802e+06) * (x[get_offset(filter_state,  -7)] + x[get_offset(filter_state,  -3)]))
     + (210.0 * (1.0 / 6.928330802e+06) * (x[get_offset(filter_state,  -6)] + x[get_offset(filter_state,  -4)]))
     + (252.0 * (1.0 / 6.928330802e+06) *  x[get_offset(filter_state,  -5)])

     + (  -0.4441854896 * y[get_offset(filter_state, -10)])
     + (   4.2144719035 * y[get_offset(filter_state,  -9)])
     + ( -18.5365677633 * y[get_offset(filter_state,  -8)])
     + (  49.7394321983 * y[get_offset(filter_state,  -7)])
     + ( -90.1491003509 * y[get_offset(filter_state,  -6)])
     + ( 115.3235358151 * y[get_offset(filter_state,  -5)])
     + (-105.4969191433 * y[get_offset(filter_state,  -4)])
     + (  68.1964705422 * y[get_offset(filter_state,  -3)])
     + ( -29.8484881821 * y[get_offset(filter_state,  -2)])
     + (   8.0012026712 * y[get_offset(filter_state,  -1)]);
}

void apply_lowpass(FilterState& filter_state, const int16_t* input, int16_t* output, int length)
{
     for (int i = 0; i < length; ++i) {
          push_sample(filter_state, input[i]);
          apply_lowpass(filter_state);
          output[i] = get_output_sample(filter_state);
     }
}
dsp.h
#include <cstdint>

struct FilterState {
	static constexpr int size = 16;

    double input[size];
    double output[size];
	unsigned int current;

	FilterState() : input{}, output{}, current{} {}
};

void apply_lowpass(FilterState& filter_state, const int16_t* input, int16_t* output, int length);

Here is the Java adaptation of the C code:

package com.example.perftest;

import com.example.perftest.DspJavaManuallyOptimized.FilterState;

public class DspJava {
	public static class FilterState {
		static final int size = 16;

		final double input[] = new double[size];
		final double output[] = new double[size];

		int current;
	}

	static short clamp(short input) {
		return (short) Math.max(Short.MIN_VALUE, Math.min(Short.MAX_VALUE, input));
	}

	static int getOffset(FilterState filterState, int relativeOffset) {
		return ((filterState.current + relativeOffset) % FilterState.size + FilterState.size) % FilterState.size;
	}

	static void pushSample(FilterState filterState, short sample) {
		filterState.input[getOffset(filterState, 0)] = sample;
		++filterState.current;
	}

	static short getOutputSample(FilterState filterState) {
		return clamp((short) filterState.output[getOffset(filterState, 0)]);
	}
	
	static void applyLowpass(FilterState filterState) {
		final double[] x = filterState.input;
		final double[] y = filterState.output;

		y[getOffset(filterState, 0)] =
		   (  1.0 * (1.0 / 6.928330802e+06) * (x[getOffset(filterState, -10)] + x[getOffset(filterState,  -0)]))
		 + ( 10.0 * (1.0 / 6.928330802e+06) * (x[getOffset(filterState,  -9)] + x[getOffset(filterState,  -1)]))
		 + ( 45.0 * (1.0 / 6.928330802e+06) * (x[getOffset(filterState,  -8)] + x[getOffset(filterState,  -2)]))
		 + (120.0 * (1.0 / 6.928330802e+06) * (x[getOffset(filterState,  -7)] + x[getOffset(filterState,  -3)]))
		 + (210.0 * (1.0 / 6.928330802e+06) * (x[getOffset(filterState,  -6)] + x[getOffset(filterState,  -4)]))
		 + (252.0 * (1.0 / 6.928330802e+06) *  x[getOffset(filterState,  -5)])

		 + (  -0.4441854896 * y[getOffset(filterState, -10)])
		 + (   4.2144719035 * y[getOffset(filterState,  -9)])
		 + ( -18.5365677633 * y[getOffset(filterState,  -8)])
		 + (  49.7394321983 * y[getOffset(filterState,  -7)])
		 + ( -90.1491003509 * y[getOffset(filterState,  -6)])
		 + ( 115.3235358151 * y[getOffset(filterState,  -5)])
		 + (-105.4969191433 * y[getOffset(filterState,  -4)])
		 + (  68.1964705422 * y[getOffset(filterState,  -3)])
		 + ( -29.8484881821 * y[getOffset(filterState,  -2)])
		 + (   8.0012026712 * y[getOffset(filterState,  -1)]);
	}

	public static void applyLowpass(FilterState filterState, short[] input, short[] output, int length) {
		for (int i = 0; i < length; ++i) {
			pushSample(filterState, input[i]);
			applyLowpass(filterState);
			output[i] = getOutputSample(filterState);
		}
	}
}

Since all of the Java runtimes tested don’t exploit static optimization opportunities as well as it seems that they could, here is an optimized version that has been inlined and has the modulo replaced with a bit mask:

package com.example.perftest;

public class DspJavaManuallyOptimized {
	public static class FilterState {
		static final int size = 16;

		final double input[] = new double[size];
		final double output[] = new double[size];

		int current;
	}

	public static void applyLowpass(FilterState filterState, short[] input, short[] output, int length) {
		for (int i = 0; i < length; ++i) {
			filterState.input[(filterState.current + 0) & (FilterState.size - 1)] = input[i];
			++filterState.current;
			final double[] x = filterState.input;
			final double[] y = filterState.output;

			y[(filterState.current + 0) & (FilterState.size - 1)] =
			   (  1.0 * (1.0 / 6.928330802e+06) * (x[(filterState.current + -10) & (FilterState.size - 1)] + x[(filterState.current + -0) & (FilterState.size - 1)]))
			 + ( 10.0 * (1.0 / 6.928330802e+06) * (x[(filterState.current + -9) & (FilterState.size - 1)] + x[(filterState.current + -1) & (FilterState.size - 1)]))
			 + ( 45.0 * (1.0 / 6.928330802e+06) * (x[(filterState.current + -8) & (FilterState.size - 1)] + x[(filterState.current + -2) & (FilterState.size - 1)]))
			 + (120.0 * (1.0 / 6.928330802e+06) * (x[(filterState.current + -7) & (FilterState.size - 1)] + x[(filterState.current + -3) & (FilterState.size - 1)]))
			 + (210.0 * (1.0 / 6.928330802e+06) * (x[(filterState.current + -6) & (FilterState.size - 1)] + x[(filterState.current + -4) & (FilterState.size - 1)]))
			 + (252.0 * (1.0 / 6.928330802e+06) *  x[(filterState.current + -5) & (FilterState.size - 1)])

			 + (  -0.4441854896 * y[(filterState.current + -10) & (FilterState.size - 1)])
			 + (   4.2144719035 * y[(filterState.current + -9) & (FilterState.size - 1)])
			 + ( -18.5365677633 * y[(filterState.current + -8) & (FilterState.size - 1)])
			 + (  49.7394321983 * y[(filterState.current + -7) & (FilterState.size - 1)])
			 + ( -90.1491003509 * y[(filterState.current + -6) & (FilterState.size - 1)])
			 + ( 115.3235358151 * y[(filterState.current + -5) & (FilterState.size - 1)])
			 + (-105.4969191433 * y[(filterState.current + -4) & (FilterState.size - 1)])
			 + (  68.1964705422 * y[(filterState.current + -3) & (FilterState.size - 1)])
			 + ( -29.8484881821 * y[(filterState.current + -2) & (FilterState.size - 1)])
			 + (   8.0012026712 * y[(filterState.current + -1) & (FilterState.size - 1)]);
			output[i] = (short) Math.max(Short.MIN_VALUE, Math.min(Short.MAX_VALUE, (short) filterState.output[(filterState.current + 0) & (FilterState.size - 1)]));
		}
	}
}

How Powerful Is Your Nexus 7?

The following post is based on a paper generously contributed by Jerome Huck, a senior aerospace/defence engineer, scientist, and author. A link to figures and the code can be found at the bottom of this post.

So you want to run some heavy-duty algorithms on your Android device, and you’re wondering what is the best environment to use, and whether your Nexus 7 tablet would be up to the job. In this post, based upon a paper generously contributed by Jerome Huck, a senior aerospace engineer & scientist, we’ll take a look at a test involving some heavy-duty computational fluid dynamics equations, and we’ll compare the execution times on a PC and a Nexus 7 tablet.

Implementation languages

Which language & development environment is the best fit? The Eclipse SDK is one obvious choice to go, with development usually done in Java. Unlocking additional performance through native C & C++ code can also be done via the Native Development Kit (NDK), though this adds complexity due to the mixing of Java, C/C++, and JNI glue code.

What if you want to develop directly on your device? Thanks to the openness of Google Play, there are many options available. The Java AIDE will allow you to write, compile and run programs directly on your Android tablet.

For native performance, C/C++ are available through C4DROID and CCTOOLS, an implementation of the GNU GCC 4.8.1 compiler. Fortran is also available for download from CCTOOLS’s menu.

Python development is available via QPython or Kivy. QPython implements the Python language but is still in beta stage; the Kivy Launcher enables you to run Kivy applications, an open source Python library for rapid development. Kivy applications can also make use of innovative user interfaces, including multi-touch apps.

So, just how powerful is your Nexus 7? Java, Basic, C/C++, Python and Fortran all seem like good candidates to evaluate the power of a Nexus 7 with a test case.

The Test Case

The test developed by Jerome involves some heavy-duty math, of the type often employed by scientists and engineers. Here are the details, as specified by Jerome and edited for formatting within this post:

For evaluating the performance, let’s use a test case using computational fluid dynamics algorithms, including Navier-Stokes fluid equations, the Maxwell electromagnetism equations, forming the magnetohydrodynamics (MHD) set of equations. The original Fortran code was published in An Introduction to Computational Fluid Mechanics by Chuen-Yen-Chow, in 1983. The MHD stationary flow calculation is no longer included in the 2011 update by Biringen and Chow, but the details pertaining to the equations discretization, stability analysis, and so on can still be found in their Benard and Taylor instabilities example of the instationary solution of Navier-Stokes equations coupled with the temperature equation.

For simplicity, a stream-vorticity formulation is used. Standard boundary conditions, or even simplified ones, are used, with a value or derivative given. Discretization of the nonlinear terms in the Navier-Stokes, the one involving the velocity components, was, historically, a source of problems. The numerical scheme has to properly capture the flow direction.

Upwind differencing form solves this problem. The spatial difference is on the upwind side of the point indexed (i,j). This numerical scheme is only first order by reference to a Taylor series expansion. The second order upwind schemes introduces non physical behaviour, such as oscillations. Total Variation Diminishing (TVD) schemes are a cure to this problem. They introduce stable, non-oscillatory, high order schemes, preserving monotonicity, with no overshot or undershoot for the solution. They are the result of more than 30 years of research in CFD.

Only the upwind scheme was present in the original Fortran code. It was rewritten using a general TVD formulation. Corner Transport Upwind (CTU) was also added as an experiment, and not fully tested. Details can be in good CFD books such as An Introduction to Computational Fluid Dynamics: The Finite Volume Method (2nd Edition) by Versteeg and Malalasekera, or Finite Volume Methods for Hyperbolic Problems by Leveque.

The solution procedure is straightforward. The current flow, RH variable, is solved via the Laplace equation solver. Then the electromagnetic force, EM variable, is computed. Time stepping is used to find the solution of the flow until the convergence criteria are matched, error or maximum step. A Poisson solver is used.

Comments are given in the Fortran source code.

The results are presented for a Reynolds number of 50, a magnetic pressure number C of 0.3, using the upwind scheme.

Execution times on a PC

Before looking at the Nexus 7 results, let’s first compare the results on the PC. Here are the results that Jerome obtained on a i3 2.1 GHz laptop, running Windows 7 64-bit:

GNU Fortran 62ms
GNU GCC 78ms
Oracle Java JDK 7u45 150ms
PyPy 2.0 1020ms
Python 3.3.2 6780ms

For this particular run, Fortran is the best, with C a close second; the Java JDK also put in a good showing here. The interpreted languages are very disappointing, as expected.

Even with the slower execution times, some scientists are still moving some their code to Python; they want to benefit from the scripting capabilities of interpreted languages. Users don’t need to edit the source code to change the boundary equations or add a subroutine to solve a particular equation. FiPy, a finite volume code from NIST, uses this approach. However, most of the critical parts are still written in C or in Fortran.

Another approach is to use a dedicated language such the one implemented in FreeFem++, a partial differential equation solver. With this tool, a problem with one billion unknowns was solved in 2 minutes on the Curie Thin Node, a CEA machine. The CEA is the French Atomic Energy and Alternative Energies Commission.

What does the Nexus 7 has to offer?

Let’s now take a look at the results on a 1.2 GHz 2012 Nexus 7; the 2013 model, with its Qualcomm Snapdragon S4 Pro at 1.5 GHz, may boost these results a step further.

Fortran CCTOOLS with -march=armv7-a -mfloat-abi=softfp -mfpu=vfpv3 -03 70ms
Fortran CCTOOLS with -march=armv7-a -mfloat-abi=softfp -mfpu=vfpv3 -02 79ms
C99 C4DROID with -march=armv7-a -mfloat-abi=softfp -mfpu=vfpv3 -02 (64-bit floats) 120ms
C99 C4DROID with -march=armv7-a -mfloat-abi=softfp -mfpu=vfpv3 (32-bit floats) 380ms
C99 C4DROID with mfloat-abi=softfp (32-bit floats) 394ms
C99 C4DROID with -march=armv7-a -mfloat-abi=softfp -mfpu=vfpv3 (32-bit floats) 420ms
C99 C4DROID with mfloat-abi=softfp (64-bit floats) 450ms
C4DROID with -msoft-float (32-bit floats) 1163ms
C4DROID with -msoft-float (64-bit floats) 1500ms
Java compiled with Eclipse 1563ms
Java AIDE with dex optimizations 2100ms
Java AIDE 3030ms
QPython 24702ms

These are the best execution times. Some variance was seen with C4DROID, while CCTOOLS was more stable overall. As before, we can see the same ranking, with Fortran emerging as the leader, and C, Java, and Python following behind. With the proper compiler flags, CCTOOLS Fortran is even competitive against the PC, which is a very good result.

The Java results, on the other hand, are quite bad. Is it a fault of the Dalvik virtual machine? Results may improve with the ART runtime, but they’d have to improve dramatically to come close to the performance of optimized FORTRAN and C.

Python, with an execution time of over 24 seconds, can definitely be forgotten for serious scientific computations.

Verdict

The Nexus 7 2012 is very powerful on this particular test, when running Fortran or C code compiled to native machine code. Can these good results be extrapolated to more demanding programs, and software that needs more time to run?

The Nexus 7 tablets are very high-quality products, and Android is a smart and fun operating system to use. The 2012 model is already quite powerful, and the 2013 should see even better results; all that’s needed is a dedicated approach to unleash the power sleeping within those processors.

Paper, equations, and code

This blog post is based on work generously contributed by Jerome Huck, a senior aerospace/defence engineer, scientist, and author. Jerome graduated from the École nationale supérieure de l’aéronautique et de l’espace in Toulouse, and has worked on various projects including the Hermes space shuttle, Rafale fighter, and is the author of “The Fire of the Magicians“.

Android Lesson Eight: An Introduction to Index Buffer Objects (IBOs)

Android Lesson Eight: An Introduction to Index Buffer Objects (IBOs)In our last lesson, we learned how to use vertex buffer objects on Android. We learned about the difference between client-side memory and GPU-dedicated memory, and the difference between storing texture, position and normal data in separate buffers, or altogether in one buffer. We also learned how to work around Froyo’s broken OpenGL ES 2.0 bindings.

In this lesson, we’ll learn about index buffer objects, and go over a practical example of how to use them. Here’s what we’re going to cover:

  • The difference between using vertex buffer objects only, and using vertex buffer objects together with index buffer objects.
  • How to join together triangle strips using degenerate triangles, and render an entire height map in a single rendering call.

Let’s get started with the fundamental difference between vertex buffer objects and index buffer objects:

Vertex buffer objects and index buffer objects

In the previous lesson, we learned that a vertex buffer object is simply an array of vertex data which is directly rendered by OpenGL. We can use separate buffers for each attribute, such as positions and colors, or we can use a single buffer and interleave all of the data together. Contemporary articles suggest interleaving the data and making sure it’s aligned to 4-byte boundaries for better performance.

The downside to vertex buffers come when we use many of the same vertices over and over again. For example, a heightmap can be broken down into a series of triangle strips. Since each neighbour strip shares one row of vertices, we will end up repeating a lot of vertices with a vertex buffer.

Diagram of a heightmap using triangle strips and a vertex buffer object.

You can see a vertex buffer object containing the vertices for two triangle strip rows. The order of vertices is shown, and the triangles defined by these vertices is also shown, when drawn using glDrawArrays(GL_TRIANGLE_STRIP, …).  In this example, we’re assuming that each row of triangles gets sent as a separate call to glDrawArrays(). Each vertex contains data as follows:

vertexBuffer = {
    // Position - Vertex 1
    0, 0, 0,
    // Color
    1,1,1
    // Normal
    0,0,1,
    // Position - Vertex 2
    1, 0, 0,
    ... 
}

As you can see, the middle row of vertices needs to be sent twice, and this will also happen for each additional row of the height map. As our height map gets larger, our vertex buffer could end up having to repeat a lot of position, color, and normal data and consume a lot of additional memory.

How can we improve on this state of affairs? We can use an index buffer object. Instead of repeating vertices in the vertex buffer, we’ll define each vertex once, and only once. We’ll refer to these vertices using offsets into this vertex buffer, and when we need to reuse a vertex, we’ll repeat the offset instead of repeating the entire vertex. Here’s a visual illustration of the new vertex buffer:

Vertex buffer object for use with an index buffer object.

Notice that our vertices are no longer linked together into triangles. We’ll no longer pass our vertex buffer object directly; instead, we’ll use an index buffer to tie the vertices together. The index buffer will contain only the offsets to our vertex buffer object. If we wanted to draw a triangle strip using the above buffer, our index buffer would contain data as follows:

indexBuffer = {
    1, 6, 2, 7, 3, 8, ...
}

That’s how we’ll link everything together. When we repeat the middle row of vertices, we only repeat the number, instead of repeating the entire block of data. We’ll draw the index buffer with a call to glDrawElements(GL_TRIANGLE_STRIP, …).

Linking together triangle strips with degenerate triangles

The examples above assumed that we would render each row of the heightmap with a separate call to glDrawArrays() or glDrawElements(). How can we link up each row to the next? After all, the end of the first row is all the way on the right, and the beginning of the second row on the left. How do we link the two?

We don’t need to get fancy and start drawing right to left or anything like that. We can instead use what’s known as a degenerate triangle. A degenerate triangle is a triangle that has no area, and when the GPU encounters such triangles, it will simply skip over them.

Let’s look at our index buffer again:

indexBuffer = {
    1, 6, 2, 7, 3, 8, 4, 9, ...
}

When drawing with GL_TRIANGLE_STRIP, OpenGL will build triangles by taking each set of three vertices, advancing by one vertex for each triangle. Every subsequent triangle shares two vertices with the previous triangle. For example, here are the sets of vertices that would be grouped into triangles:

  • Triangle 1 = 1, 6, 2
  • Triangle 2 = 6, 2, 7
  • Triangle 3 = 2, 7, 3
  • Triangle 4 = 7, 3, 8
  • Triangle 5 = 3, 8, 4
  • Triangle 6 = 8, 4, 9

OpenGL also maintains a specific order, or winding when building the triangles. The order of the first 3 vertices determines the order for the rest. If the first triangle is counter-clockwise, the rest will be counter-clockwise, too. OpenGL does this by swapping the first two vertices of every even triangle (swapped vertices are in bold):

  • Triangle 1 = 1, 6, 2
  • Triangle 2 = 2, 6, 7
  • Triangle 3 = 2, 7, 3
  • Triangle 4 = 3, 7, 8
  • Triangle 5 = 3, 8, 4
  • Triangle 6 = 4, 8, 9

Let’s show the entire index buffer, including the missing link between each row of the height map:

indexBuffer = {
    1, 6, 2, 7, 3, 8, 4, 9, 5, 10, ..., 6, 11, 7, 12, 8, 13, 9, 14, 10, 15
}

What do we need to put in between, in order to link up the triangles? We’ll need an even number of new triangles in order to preserve the winding. We can do this by repeating the last vertex of the first row, and the first vertex of the second row. Here’s the new index buffer below, with the duplicated vertices in bold:

indexBuffer = {
    1, 6, 2, 7, 3, 8, 4, 9, 5, 10, 10, 6, 6, 11, 7, 12, 8, 13, 9, 14, 10, 15
}

Here’s what the new sequence of triangles looks like:

  • Triangle 8 = 5, 9, 10
  • Triangle 9 (degenerate) = 5, 10, 10
  • Triangle 10 (degenerate) = 10, 10, 6
  • Triangle 11 (degenerate) = 10, 6, 6
  • Triangle 12 (degenerate) = 6, 6, 11
  • Triangle 13 = 6, 11, 7

By repeating the last vertex and the first vertex, we created four degenerate triangles that will be skipped, and linked the first row of the height map with the second. We could link an arbitrary number of rows this way and draw the entire heightmap with one call to glDrawElements(). Let’s take a look at this visually:

Index buffer object for height map with degenerate triangles.

The degenerate triangles link each row with the next row.

Degenerate triangles done the wrong way

We need to repeat both the last vertex of the first row and the first of the second row. What would happen if we didn’t? Let’s say we only repeated one vertex:

indexBuffer = {
    1, 6, 2, 7, 3, 8, 4, 9, 5, 10, 10, 6, 11, 7, 12, 8, 13, 9, 14, 10, 15
}

Here’s what the sequence of triangles would look like:

  • Triangle 8 = 5, 9, 10
  • Triangle 9 (degenerate) = 5, 10, 10
  • Triangle 10 (degenerate) = 10, 10, 6
  • Triangle 11 = 10, 6, 11
  • Triangle 12 = 11, 6, 7

Triangle 11 starts at the right and cuts all the way across to the left, which isn’t what we wanted to happen. The winding is now also incorrect for the next row of triangles, since 3 new triangles were inserted, swapping even and odd.

A practical example

Let’s walk through the code to make it happen. I highly recommend heading over and reading Android Lesson Seven: An Introduction to Vertex Buffer Objects (VBOs) before continuing.

Do you remember those graph calculators, that could draw parabolas and other stuff on the screen? In this example, we’re going to draw a 3d parabola using a height map. We’ll walk through all of the code to build and draw the height map. First, let’s get started with the definitions:

class HeightMap {
	static final int SIZE_PER_SIDE = 32;
	static final float MIN_POSITION = -5f;
	static final float POSITION_RANGE = 10f;

	final int[] vbo = new int[1];
	final int[] ibo = new int[1];

	int indexCount;
  • We’ll set the height map to 32 units per side, for a total of 1,024 vertices and 1,922 triangles, not including degenerate triangles (The total number of triangles in a height map is equal to (2 * (units_per_side – 1)2)). The height map positions will range from -5 to +5.
  • The OpenGL reference to our vertex buffer object and index buffer object will go in vbo and ibo, respectively.
  • indexCount will hold the total number of generated indices.
Building the vertex data

Let’s look at the code that will build the vertex buffer object. Remember that we still need a place to hold all of our vertices, and that each vertex will be defined once, and only once.

HeightMap() {
	try {
		final int floatsPerVertex = POSITION_DATA_SIZE_IN_ELEMENTS + NORMAL_DATA_SIZE_IN_ELEMENTS
				+ COLOR_DATA_SIZE_IN_ELEMENTS;
		final int xLength = SIZE_PER_SIDE;
		final int yLength = SIZE_PER_SIDE;

		final float[] heightMapVertexData = new float[xLength * yLength * floatsPerVertex];

		int offset = 0;

		// First, build the data for the vertex buffer
		for (int y = 0; y < yLength; y++) {
			for (int x = 0; x < xLength; x++) {
				final float xRatio = x / (float) (xLength - 1);

				// Build our heightmap from the top down, so that our triangles are 
				// counter-clockwise.
				final float yRatio = 1f - (y / (float) (yLength - 1));

				final float xPosition = MIN_POSITION + (xRatio * POSITION_RANGE);
				final float yPosition = MIN_POSITION + (yRatio * POSITION_RANGE);

				...
			}
		}

This bit of code sets up the loop to generate the vertices. Before we can send data to OpenGL, we need to build it in Java’s memory, so we create a floating point array to hold the height map vertices. Each vertex will hold enough floats to contain all of the position, normal, and color information.

Inside the loop, we calculate a ratio between 0 and 1 for x and y. This xRatio and yRatio will then be used to calculate the current position for the next vertex.

Let’s take a look at the actual calculations within the loop:

// Position
heightMapVertexData[offset++] = xPosition;
heightMapVertexData[offset++] = yPosition;
heightMapVertexData[offset++] = ((xPosition * xPosition) + (yPosition * yPosition)) / 10f;

First up is the position. Since this is a 3D parabola, we’ll calculate the Z as X2 + Y2. We divide the result by 10 so the resulting parabola isn’t so steep.

// Cheap normal using a derivative of the function.
// The slope for X will be 2X, for Y will be 2Y.
// Divide by 10 since the position's Z is also divided by 10.
final float xSlope = (2 * xPosition) / 10f;
final float ySlope = (2 * yPosition) / 10f;

// Calculate the normal using the cross product of the slopes.
final float[] planeVectorX = {1f, 0f, xSlope};
final float[] planeVectorY = {0f, 1f, ySlope};
final float[] normalVector = {
		(planeVectorX[1] * planeVectorY[2]) - (planeVectorX[2] * planeVectorY[1]),
		(planeVectorX[2] * planeVectorY[0]) - (planeVectorX[0] * planeVectorY[2]),
		(planeVectorX[0] * planeVectorY[1]) - (planeVectorX[1] * planeVectorY[0])};

// Normalize the normal
final float length = Matrix.length(normalVector[0], normalVector[1], normalVector[2]);

heightMapVertexData[offset++] = normalVector[0] / length;
heightMapVertexData[offset++] = normalVector[1] / length;
heightMapVertexData[offset++] = normalVector[2] / length;

Next up is the normal calculation. As you’ll remember from our lesson on lighting, the normal will be used to calculate lighting. The normal of a surface is defined as a vector perpendicular to the tangent plane at that particular point. In other words, the normal should be an arrow pointing straight away from the surface. Here’s an visual example for a parabola:

Parabola with normals

The first thing we need is the tangent of the surface. Using a bit of calculus, (don’t worry, I didn’t remember it either and went and searched for an online calculator ;)) we know that we can get the tangent, or the slope from the derivative of the function. Since our function is X2 + Y2, our slopes will therefore be 2X and 2Y. We scale the slopes down by 10, since for the position we had also scaled the result of the function down by 10.

To calculate the normal, we create two vectors for each slope to define a plane, and we calculate the cross product to get the normal, which is the perpendicular vector.

We then normalize the normal by calculating its length, and dividing each component by the length. This ensures that the overall length will be equal to 1.

// Add some fancy colors.
heightMapVertexData[offset++] = xRatio;
heightMapVertexData[offset++] = yRatio;
heightMapVertexData[offset++] = 0.5f;
heightMapVertexData[offset++] = 1f;

Finally, we’ll set some fancy colors. Red will scale from 0 to 1 across the X axis, and green will scale from 0 to 1 across the Y axis. We add a bit of blue to brighten things up, and assign 1 to alpha.

Building the index data

The next step is to link all of these vertices together, using the index buffer.

// Now build the index data
final int numStripsRequired = yLength - 1;
final int numDegensRequired = 2 * (numStripsRequired - 1);
final int verticesPerStrip = 2 * xLength;

final short[] heightMapIndexData = new short[(verticesPerStrip * numStripsRequired)
		+ numDegensRequired];

offset = 0;

for (int y = 0; y < yLength - 1; y++) { 	 	if (y > 0) {
		// Degenerate begin: repeat first vertex
		heightMapIndexData[offset++] = (short) (y * yLength);
	}

	for (int x = 0; x < xLength; x++) {
		// One part of the strip
		heightMapIndexData[offset++] = (short) ((y * yLength) + x);
		heightMapIndexData[offset++] = (short) (((y + 1) * yLength) + x);
	}

	if (y < yLength - 2) {
		// Degenerate end: repeat last vertex
		heightMapIndexData[offset++] = (short) (((y + 1) * yLength) + (xLength - 1));
	}
}

indexCount = heightMapIndexData.length;

In OpenGL ES 2, an index buffer needs to be an array of unsigned bytes or shorts, so we use shorts here. We’ll read from two rows of vertices at a time, and build a triangle strip using those vertices. If we’re on the second or subsequent rows, we’ll duplicate the first vertex of that row, and if we’re on any row but the last, we’ll also duplicate the last vertex of that row. This is so we can link the rows together using degenerate triangles as described earlier.

We’re just assigning offsets here. Imagine we had the height map as shown earlier:

Vertex buffer object for use with an index buffer object.

With the index buffer, we want to end up with something like this:

Index buffer object for height map with degenerate triangles.

Our buffer will contain data like this:

heightMapIndexData = {1, 6, 2, 7, 3, 8, 4, 9, 5, 10, 10, 6, 6, 11, 7, 12, 8, 13, 9, 14, 10, 15}

Just keep in mind that although our examples start with 1, and go on to 2, 3, etc… in the actual code, our arrays should be 0-based and start with 0.

Uploading data into vertex and index buffer objects

The next step is to copy the data from Dalvik’s heap to a direct buffer on the native heap:

final FloatBuffer heightMapVertexDataBuffer = ByteBuffer
	.allocateDirect(heightMapVertexData.length * BYTES_PER_FLOAT).order(ByteOrder.nativeOrder())
	.asFloatBuffer();
heightMapVertexDataBuffer.put(heightMapVertexData).position(0);

final ShortBuffer heightMapIndexDataBuffer = ByteBuffer
	.allocateDirect(heightMapIndexData.length * BYTES_PER_SHORT).order(ByteOrder.nativeOrder())
	.asShortBuffer();
heightMapIndexDataBuffer.put(heightMapIndexData).position(0);

Remember, the index data needs to go in a short buffer or a byte buffer. Now we can create OpenGL buffers, and upload our data into the buffers:

GLES20.glGenBuffers(1, vbo, 0);
GLES20.glGenBuffers(1, ibo, 0);

if (vbo[0] > 0 && ibo[0] > 0) {
	GLES20.glBindBuffer(GLES20.GL_ARRAY_BUFFER, vbo[0]);
	GLES20.glBufferData(GLES20.GL_ARRAY_BUFFER, heightMapVertexDataBuffer.capacity()
							* BYTES_PER_FLOAT, heightMapVertexDataBuffer, GLES20.GL_STATIC_DRAW);

	GLES20.glBindBuffer(GLES20.GL_ELEMENT_ARRAY_BUFFER, ibo[0]);
	GLES20.glBufferData(GLES20.GL_ELEMENT_ARRAY_BUFFER, heightMapIndexDataBuffer.capacity()
							* BYTES_PER_SHORT, heightMapIndexDataBuffer, GLES20.GL_STATIC_DRAW);

	GLES20.glBindBuffer(GLES20.GL_ARRAY_BUFFER, 0);
	GLES20.glBindBuffer(GLES20.GL_ELEMENT_ARRAY_BUFFER, 0);
} else {
	errorHandler.handleError(ErrorType.BUFFER_CREATION_ERROR, "glGenBuffers");
}

We use GL_ARRAY_BUFFER to specify our vertex data, and GL_ELEMENT_ARRAY_BUFFER to specify our index data.

Drawing the height map

Much of the code to draw the height map will be similar as in previous lessons. I won’t cover the matrix setup code here; instead, we’ll just look at the calls to bind the array data and draw using the index buffer:

GLES20.glBindBuffer(GLES20.GL_ARRAY_BUFFER, vbo[0]);

// Bind Attributes
glEs20.glVertexAttribPointer(positionAttribute, POSITION_DATA_SIZE_IN_ELEMENTS, GLES20.GL_FLOAT,
		false, STRIDE, 0);
GLES20.glEnableVertexAttribArray(positionAttribute);

glEs20.glVertexAttribPointer(normalAttribute, NORMAL_DATA_SIZE_IN_ELEMENTS, GLES20.GL_FLOAT,
		false, STRIDE, POSITION_DATA_SIZE_IN_ELEMENTS * BYTES_PER_FLOAT);
GLES20.glEnableVertexAttribArray(normalAttribute);

glEs20.glVertexAttribPointer(colorAttribute, COLOR_DATA_SIZE_IN_ELEMENTS, GLES20.GL_FLOAT,
		false, STRIDE, (POSITION_DATA_SIZE_IN_ELEMENTS + NORMAL_DATA_SIZE_IN_ELEMENTS)
		* BYTES_PER_FLOAT);
GLES20.glEnableVertexAttribArray(colorAttribute);

// Draw
GLES20.glBindBuffer(GLES20.GL_ELEMENT_ARRAY_BUFFER, ibo[0]);
glEs20.glDrawElements(GLES20.GL_TRIANGLE_STRIP, indexCount, GLES20.GL_UNSIGNED_SHORT, 0);

GLES20.glBindBuffer(GLES20.GL_ARRAY_BUFFER, 0);
GLES20.glBindBuffer(GLES20.GL_ELEMENT_ARRAY_BUFFER, 0);

In the previous lesson, we had to introduce a custom OpenGL binding to properly use VBOs, as they’re broken in Froyo. We’ll also need to use this binding for IBOs as well. Just like in the previous lesson, we bind our position, normal, and color data in our vertex buffer to the matching attribute in the shader, taking care to pass in the proper stride and start offset of each attribute. Both the stride and the start offset are defined in terms of bytes.

The main difference is when it’s time to draw. We call glDrawElements() instead of glDrawArrays(), and we pass in the index count and data type. OpenGL ES 2.0 only accepts GL_UNSIGNED_SHORT and GL_UNSIGNED_BYTE, so we have to make sure that we define our index data using shorts or bytes.

Rendering and lighting two-sided triangles

For this lesson, we don’t enable GL_CULL_FACE so that both the front and the back sides of triangles are visible. We need to make a slight change to our fragment shader code so that lighting works properly for our two-sided triangles:

float diffuse;

if (gl_FrontFacing) {
    diffuse = max(dot(v_Normal, lightVector), 0.0);
} else {
    diffuse = max(dot(-v_Normal, lightVector), 0.0);
}

We use the special variable gl_FrontFacing to find out if the current fragment is part of a front-facing or back-facing triangle. If it’s front-facing, then no need to do anything special: the code is the same as before. If it’s back-facing, then we simply invert the surface normal (since the back side of a triangle faces the opposite direction) and proceed with the calculations as before.

Further exercises

When would it be a downside to use index buffers? Remember, when we use index buffers the GPU has to do an additional fetch into the vertex buffer object, so we’re introducing an additional step.

Wrapping up

The full source code for this lesson can be downloaded from the project site on GitHub.

Thanks for stopping by, and please feel free to check out the code and share your comments below.

Enhanced by Zemanta

Android Lesson Seven: An Introduction to Vertex Buffer Objects (VBOs)

A screenshot of Lesson Seven, showing the grid of cubes.In this lesson, we’ll introduce vertex buffer objects (VBOs), how to define them, and how to use them. Here is what we are going to cover:

  • How to define and render from vertex buffer objects.
  • The difference between using a single buffer with all the data packed in, or multiple buffers.
  • Problems and pitfalls, and what to do about them.

What are vertex buffer objects, and why use them?

Up until now, all of our lessons have been storing our object data in client-side memory, only transferring it into the GPU at render time. This is fine when there is not a lot of data to transfer, but as our scenes get more complex with more objects and triangles, this can impose an extra cost on the CPU and memory usage. What can we do about this? We can use vertex buffer objects. Instead of transferring vertex information from client memory every frame, the information will be transferred once and rendering will then be done from this graphics memory cache.

Assumptions and prerequisites

Please read Android Lesson One: Getting Started for an intro on how to upload the vertices from client-side memory. This understanding of how OpenGL ES works with the vertex arrays will be crucial to understanding this lesson.

Understanding client-side buffers in more detail

Once you understand how to render using client-side memory, it’s actually not too hard to switch to using VBOs. The main difference is that there is an additional step to upload the data into graphics memory, and an additional call to bind to this buffer when rendering.

This lesson has been setup to use four different modes:

  • Client side, separate buffers.
  • Client side, packed buffer.
  • Vertex buffer object, separate buffers.
  • Vertex buffer object, packed buffers.

Whether we are using vertex buffer objects or not, we need to first store our data in a client-side direct buffer. Recall from lesson one that OpenGL ES is a native system library, whereas Java on Android runs in a virtual machine. To bridge the gap, we need to use a set of special buffer classes to allocate memory on the native heap and make it accessible to OpenGL:

// Java array.
float[] cubePositions;
...
// Floating-point buffer
final FloatBuffer cubePositionsBuffer;
...

// Allocate a direct block of memory on the native heap,
// size in bytes is equal to cubePositions.length * BYTES_PER_FLOAT.
// BYTES_PER_FLOAT is equal to 4, since a float is 32-bits, or 4 bytes.
cubePositionsBuffer = ByteBuffer.allocateDirect(cubePositions.length * BYTES_PER_FLOAT)

// Floats can be in big-endian or little-endian order.
// We want the same as the native platform.
.order(ByteOrder.nativeOrder())

// Give us a floating-point view on this byte buffer.
.asFloatBuffer();

Transferring data from the Java heap to the native heap is then a matter of a couple calls:

// Copy data from the Java heap to the native heap.
cubePositionsBuffer.put(cubePositions)

// Reset the buffer position to the beginning of the buffer.
.position(0);

What is the purpose of the buffer position? Normally, Java does not give us a way to specify arbitrary locations in memory using pointer arithmetic. However, setting the position of the buffer is functionally equivalent to changing the value of a pointer to a block of memory. By changing the position, we can pass arbitrary memory locations within our buffer to OpenGL calls. This will come in handy when we work with packed buffers.

Once the data is on the native heap, we no longer need to keep the float[] array around, and we can let the garbage collector clean it up.

Rendering with client-side buffers is straightforward to setup. We just need to enable using vertex arrays on that attribute, and pass a pointer to our data:

// Pass in the position information
GLES20.glEnableVertexAttribArray(mPositionHandle);
GLES20.glVertexAttribPointer(mPositionHandle, POSITION_DATA_SIZE,
	GLES20.GL_FLOAT, false, 0, mCubePositions);

Explanation of the parameters to glVertexAttribPointer:

  • mPositionHandle: The OpenGL index of the position attribute of our shader program.
  • POSITION_DATA_SIZE: How many elements (floats) define this attribute.
  • GL_FLOAT: The type of each element.
  • false: Should fixed-point data be normalized? Not applicable since we are using floating-point data.
  • 0: The stride. Set to 0 to mean that the positions should be read sequentially.
  • mCubePositions: The pointer to our buffer, containing all of the positional data.
Working with packed buffers

Working with packed buffers is very similar, except that instead of using a buffer each for positions, normals, etc… one buffer will contain all of this data. The difference looks like this:

Using separate buffers

positions = X,Y,Z, X, Y, Z, X, Y, Z, …
colors = R, G, B, A, R, G, B, A, …
textureCoordinates = S, T, S, T, S, T, …

Using a packed buffer

buffer = X, Y, Z, R, G, B, A, S, T, …

The advantage to using packed buffers is that it should be more efficient for the GPU to render, since all of the information needed to render a triangle is located within the same block of memory. The disadvantage is that it may be more difficult and slower to update, if you are using dynamic data.

When we use packed buffers, we need to change our rendering calls in a couple of ways. First, we need to tell OpenGL the stride, or how many bytes define a vertex.

final int stride = (POSITION_DATA_SIZE + NORMAL_DATA_SIZE + TEXTURE_COORDINATE_DATA_SIZE)
	* BYTES_PER_FLOAT;

// Pass in the position information
mCubeBuffer.position(0);
GLES20.glEnableVertexAttribArray(mPositionHandle);
GLES20.glVertexAttribPointer(mPositionHandle, POSITION_DATA_SIZE,
	GLES20.GL_FLOAT, false, stride, mCubeBuffer);

// Pass in the normal information
mCubeBuffer.position(POSITION_DATA_SIZE);
GLES20.glEnableVertexAttribArray(mNormalHandle);
GLES20.glVertexAttribPointer(mNormalHandle, NORMAL_DATA_SIZE,
	GLES20.GL_FLOAT, false, stride, mCubeBuffer);
...

The stride tells OpenGL ES how far it needs to go to find the same attribute for the next vertex. For example, if element 0 is the beginning of the position for the first vertex, and there are 8 elements per vertex, then the stride will be equal to 8 elements, or 32 bytes. The position for the next vertex will be found at element 8, and the next vertex after that at element 16, and so on.

Keep in mind that the value of the stride passed to glVertexAttribPointer should be in bytes, not elements, so remember to do that conversion.

Notice that we also change the start position of the buffer when we switch from specifying the positions to the normals. This is the pointer arithmetic I was referring to before, and this is how we can do it in Java when working with OpenGL ES. We’re still working with the same buffer, mCubeBuffer, but we tell OpenGL to start reading in the normals at the first element after the position. Again, we pass in the stride to tell OpenGL that the next normal will be found 8 elements or 32 bytes later.

Dalvik and memory on the native heap

If you allocate a lot of memory on the native heap and release it, you will probably run into the beloved OutOfMemoryError, sooner or later. There are a couple of reasons behind that:

  1. You might think that you’ve released the memory by letting the reference go out of scope, but native memory seems to take a few extra GC cycles to be completely cleaned up, and Dalvik will throw an exception if there is not enough free memory available and the native memory has not yet been released.
  2. The native heap can become fragmented. Calls to allocateDirect() will inexplicably fail, even though there appears to be plenty of memory available. Sometimes it helps to make a smaller allocation, free it, and then try the larger allocation again.

What can you do about these problems? Not much, other than hoping that Google improves the behaviour of Dalvik in future editions (they’ve added a largeHeap parameter to 3.0+), or manage the heap yourself by doing your allocations in native code or allocating a huge block upfront, and spinning off buffers based off of that.

Note: this information was originally written in early 2012, and now Android uses a different runtime called ART which may not suffer from these problems to the same degree.

Moving to vertex buffer objects

Now that we’ve reviewed working with client-side buffers, let’s move on to vertex buffer objects! First, we need to review a few very important points:

1. Buffers must be created within a valid OpenGL context.

This might seem like an obvious point, but it’s just a reminder that you have to wait until onSurfaceCreated(), and you have to take care that the OpenGL ES calls are done on the GL thread. See this document: OpenGL ES Programming Guide for iOS. It might be written for iOS, but the behaviour of OpenGL ES is similar on Android.

2. Improper use of vertex buffer objects will crash the graphics driver.

You need to be careful with the data you pass around when you use vertex buffer objects. Improper values will cause a native crash in the OpenGL ES system library or in the graphics driver library. On my Nexus S, some games freeze up my phone completely or cause it to reboot, because the graphics driver is crashing on their commands. Not all crashes will lock up your device, but at a minimum you will not see the “This application has stopped working” dialogue. Your activity will restart without warning, and the only info you’ll get might be a native debug trace in the logs.

3. The OpenGL ES bindings are broken on Froyo (2.2), and incomplete/unavailable in earlier versions.

This is the most unfortunate and most important point to consider. For some reason, Google really dropped the ball when it comes to OpenGL ES 2 support on Froyo. The mappings are incomplete, and several crucial functions needed to use vertex buffer objects are unavailable and cannot be used from Java code, at least with the standard SDK.

I don’t know if it’s because they didn’t run their unit tests, or if the developer was sloppy with their code generation tools, or if everyone was on 8 cups of coffee and burning the midnight oil to get things out the door. I don’t know why the API is broken, but the fact is that it’s broken.

There are three solutions to this problem:

  1. Target Gingerbread (2.3) and higher.
  2. Don’t use vertex buffer objects.
  3. Use your own Java Native Interface (JNI) library to interface with the native OpenGL ES system libraries.

I find option 1 to be unacceptable, since a full quarter of devices out there still run on Froyo as of the time of this writing. Option 2 works, but is kind of silly.

Note: This article was originally written in early 2012, when many devices were still on Froyo. As of 2017, this is no longer an issue and the most reasonable option is to target Gingerbread or later.

The option I recommend, and that I have decided to go with, is to use your own JNI bindings. For this lesson I have decided to go with the bindings generously provided by the guys who created libgdx, a cross-platform game development library licensed under the Apache License 2.0. You need to use the following files to make it work:

  • /libs/armeabi/libandroidgl20.so
  • /libs/armeabi-v7a/libandroidgl20.so
  • src/com/badlogic/gdx/backends/android/AndroidGL20.java
  • src/com/badlogic/gdx/graphics/GL20.java
  • src/com/badlogic/gdx/graphics/GLCommon.java

You might notice that this excludes Android platforms that do not run on ARM, and you’d be right. It would probably be possible to compile your own bindings for those platforms if you want to have VBO support on Froyo, though that is out of the scope of this lesson.

Using the bindings is as simple as these lines of code:

AndroidGL20 mGlEs20 = new AndroidGL20();
...
mGlEs20.glVertexAttribPointer(mPositionHandle, POSITION_DATA_SIZE, GLES20.GL_FLOAT, false, 0, 0);
...

You only need to call the custom binding where the SDK-provided binding is incomplete. I use the custom bindings to fill in the holes where the official one is missing functions.

Uploading vertex data to the GPU.

To upload data to the GPU, we need to follow the same steps in creating a client-side buffer as before:

...
cubePositionsBuffer = ByteBuffer.allocateDirect(cubePositions.length * BYTES_PER_FLOAT)
.order(ByteOrder.nativeOrder()).asFloatBuffer();
cubePositionsBuffer.put(cubePositions).position(0);
...

Once we have the client-side buffer, we can create a vertex buffer object and upload data from client memory to the GPU with the following commands:

// First, generate as many buffers as we need.
// This will give us the OpenGL handles for these buffers.
final int buffers[] = new int[3];
GLES20.glGenBuffers(3, buffers, 0);

// Bind to the buffer. Future commands will affect this buffer specifically.
GLES20.glBindBuffer(GLES20.GL_ARRAY_BUFFER, buffers[0]);

// Transfer data from client memory to the buffer.
// We can release the client memory after this call.
GLES20.glBufferData(GLES20.GL_ARRAY_BUFFER, cubePositionsBuffer.capacity() * BYTES_PER_FLOAT,
	cubePositionsBuffer, GLES20.GL_STATIC_DRAW);

// IMPORTANT: Unbind from the buffer when we're done with it.
GLES20.glBindBuffer(GLES20.GL_ARRAY_BUFFER, 0);

Once data has been uploaded to OpenGL ES, we can release the client-side memory as we no longer need to keep it around. Here is an explanation of glBufferData:

  • GL_ARRAY_BUFFER: This buffer contains an array of vertex data.
  • cubePositionsBuffer.capacity() * BYTES_PER_FLOAT: The number of bytes this buffer should contain.
  • cubePositionsBuffer: The source that will be copied to this vertex buffer object.
  • GL_STATIC_DRAW: The buffer will not be updated dynamically.

Our call to glVertexAttribPointer looks a little bit different, as the last parameter is now an offset rather than a pointer to our client-side memory:

// Pass in the position information
GLES20.glBindBuffer(GLES20.GL_ARRAY_BUFFER, mCubePositionsBufferIdx);
GLES20.glEnableVertexAttribArray(mPositionHandle);
mGlEs20.glVertexAttribPointer(mPositionHandle, POSITION_DATA_SIZE, GLES20.GL_FLOAT, false, 0, 0);
...

Like before, we bind to the buffer, then enable the vertex array. Since the buffer is already bound, we only need to tell OpenGL the offset to start at when reading from the buffer. Since we are using separate buffers, we pass in an offset of 0. Notice also that we are using our custom binding to call glVertexAttribPointer, since the official SDK is missing this specific function call.

Once we are done drawing with our buffer, we should unbind from it:

GLES20.glBindBuffer(GLES20.GL_ARRAY_BUFFER, 0);

When we no longer want to keep our buffers around, we can free the memory:

final int[] buffersToDelete = new int[] { mCubePositionsBufferIdx, mCubeNormalsBufferIdx,
	mCubeTexCoordsBufferIdx };
GLES20.glDeleteBuffers(buffersToDelete.length, buffersToDelete, 0);
Packed vertex buffer objects

We can also use a single, packed vertex buffer object to hold all of our vertex data. The creation of a packed buffer is the same as above, with the only difference being that we start from a packed client-side buffer. Rendering from the packed buffer is also the same, except we need to pass in a stride and an offset, like when using packed buffers in client-side memory:

final int stride = (POSITION_DATA_SIZE + NORMAL_DATA_SIZE + TEXTURE_COORDINATE_DATA_SIZE)
	* BYTES_PER_FLOAT;

// Pass in the position information
GLES20.glBindBuffer(GLES20.GL_ARRAY_BUFFER, mCubeBufferIdx);
GLES20.glEnableVertexAttribArray(mPositionHandle);
mGlEs20.glVertexAttribPointer(mPositionHandle, POSITION_DATA_SIZE, 
	GLES20.GL_FLOAT, false, stride, 0);

// Pass in the normal information
GLES20.glBindBuffer(GLES20.GL_ARRAY_BUFFER, mCubeBufferIdx);
GLES20.glEnableVertexAttribArray(mNormalHandle);
mGlEs20.glVertexAttribPointer(mNormalHandle, NORMAL_DATA_SIZE, 
	GLES20.GL_FLOAT, false, stride, POSITION_DATA_SIZE * BYTES_PER_FLOAT);
...

Notice that the offset needs to be specified in bytes. The same considerations of unbinding and deleting the buffer apply, as before.

Putting it all together

This lesson is setup so that it builds a cube of cubes, with the same number of cubes in each dimension. It will build a cube of cubes between 1x1x1 cubes, and 16x16x16 cubes. Since each cube shares the same normal and texture data, this data will be copied repeatedly when we initialize our client-side buffer. All of the cubes will end up inside the same buffer objects.

You can view the code for the lesson and view an example of rendering with and without VBOs, and with and without packed buffers. Check the code to see how some of the following was handled:

  • Posting events from the OpenGL thread back to the main UI thread, via runOnUiThread.
  • Generating the vertex data asynchronously.
  • Handling out of memory errors.
  • We removed the call to glEnable(GL_TEXTURE_2D), since that is actually an invalid enum on OpenGL ES 2. This is a hold over from the fixed pipeline days; In OpenGL ES 2 this stuff is handled by shaders, so no need to use a glEnable/glDisable.
  • How to render using different paths, without adding too many if statements and conditions.
Further exercises

When would you use vertex buffers and when is it better to stream data from client memory? What are some of the drawbacks of using vertex buffer objects? How would you improve the asynchronous loading code?

Wrapping up

The full source code for this lesson can be downloaded from the project site on GitHub.

Thanks for stopping by, and please feel free to check out the code and share your comments below. A special thanks goes out again to the guys at libgdx for generously providing the source code and libraries for their OpenGL ES 2 bindings for Android 2.2!

Enhanced by Zemanta