A performance comparison redux: Java, C, and Renderscript on the Nexus 5

In my previous post on this topic, A performance comparison between Java and C on the Nexus 5, I compared the performance of an audio low-pass filter in Java and C. The results were clear: The C version outperformed, and by a significant amount. This result brought more attention to the post than I was expecting; some of you were also curious about RenderScript, and I’m pleased to say that Jean-Luc Brouillet, a member of Google’s RenderScript team, got in touch with me and generously volunteered an implementation of the DSP code in RenderScript.

With this new code, I refactored the code into a new benchmark with test audio data, so that I could compare the different implementations and verify their output. I’ll be sharing both the code and the results with you today.

Motivations and intentions

Some of you might be curious about why I am so interested in this subject. 🙂 I normally spend most of my development hours coding for Android, using Java; in fact, my first book, OpenGL ES 2 for Android: A Quick-Start Guide, is a beginner’s guide to OpenGL that focuses on Android and Java.

Normally, when I develop code, the most important questions on my mind are: “Is this easy to maintain?” “Is it correct?” “If I come back and revisit this code a month later, am I going to understand what the heck I was doing?” Since Java is the primary development language on Android, it just makes sense for me to do most of my development there.

So why the recent focus on native development? Here are two big reasons:

  • The performance of Java on Android isn’t suitable for everything. For critical performance paths, it can be a big competitive advantage to move that code over to native, so that it completes in less time and uses less battery.
  • I’m interested in branching out to other platforms down the road, probably starting with iOS, and I’m curious if it makes sense to share some code between iOS and Android using a common code base in C/C++. It’s important that this code runs without many abstractions in the way, so I’m not very interested in custom/proprietary solutions like Xamarin’s C# or an HTML5-based toolkit.

It’s starting to become clear to me that it can make sense to work with more than one language, and to choose these languages in situations where the benefits outweigh the cost. Trying to work with Android’s UI toolkit from C++ is painful; running a DSP filter from Java and watching it use more battery and take more time than it needs to is just as painful.

Our new test scenario

For this round of benchmarks, we’ll be comparing several different implementations of a low-pass IIR filter, implemented with coefficients generated with mkfilter. We’ll run a test audio file through each implementation, and record the best score for each.

How does the test work?

  1. First, we load a test audio file into memory.
  2. We then execute the DSP algorithm over the test audio, benchmarking the elapsed time. The data is processed in chunks to reflect the fact that this is similar to how we would process data coming off of the microphone.
  3. The results are written to a new audio file in the device’s default storage, under “PerformanceTest/”.

Here are our test implementations:

  1. Java. This is a straightforward implementation of the algorithm.
  2. Java (tuned). This is the same as 1, but with all of the functions manually inlined.
  3. C. This uses the Java Native Interface (JNI) to pass the data back and forth.
  4. RenderScript. A big thank you to Mr. Brouillet from the RenderScript team for taking the time to contribute this!

The tests were run on a Nexus 5 device running Android 4.4.3. Here are the results:

Results

Implementation Execution environment Compiler Shorts/second Relative run time
(lower is better)
C Dalvik JNI gcc 4.6 17,639,081 1.00
C Dalvik JNI gcc 4.8 16,516,757 1.07
RenderScript Dalvik RenderScript (API 19) 15,206,495 1.16
RenderScript Dalvik RenderScript (API 18) 13,234,397 1.33
C Dalvik JNI clang 3.4 13,208,408 1.34
Java (tuned) Art (Proguard) 7,235,607 2.44
Java (tuned) Art 7,097,363 2.49
Java (tuned) Dalvik 5,678,990 3.11
Java (tuned) Dalvik (Proguard) 5,365,814 3.29
Java Art (Proguard) 3,512,426 5.02
Java Art 3,049,986 5.78
Java Dalvik (Proguard) 1,220,600 14.45
Java Dalvik 1,083,015 16.29

 

For this test, the C implementation is the king of the hill, with gcc 4.6 giving the best performance. The gcc compiler is followed by RenderScript and clang 3.4, and the two Java implementations are at the back of the pack, with Dalvik giving the worst performance.

C

The C implementation compiled with gcc gave the best performance out of the entire group. All tests were done with -ffast-math and -O3, using the NDK r9d. Switching between Dalvik and ART had no impact on the C run times.

I’m not sure why there is still a large gap between clang and gcc; would everything on iOS run that much faster if Apple was using gcc?  Clang will likely continue to improve and I hope to see this gap closed in the future. I’m also curious about why gcc 4.6 seems to generate better code than 4.8. Perhaps someone familiar with ARM assembly and the compilers would be able to weigh in why?

Even though I’m a newbie at C and I learned about JNI in part by doing these benchmarks, I didn’t find the code overly difficult to write. There’s enough documentation out there that I was able to figure things out, and the algorithm output matches that of the other implementations; however, since C is an unsafe language, I’m not entirely convinced that I haven’t stumbled into undefined behaviour or otherwise done something insane. 🙂

RenderScript

In the previous post, someone asked about RenderScript, so I started working on an implementation. Unfortunately, I had zero experience with RenderScript at the time so I wasn’t able to get it working. Luckily, Jean-Luc Brouillet from the RenderScript team also saw the post and ported over the algorithm for me!

As you can see, the results are very promising: RenderScript offers better performance than clang and almost the same performance as gcc, without requiring use of the NDK or of JNI glue! As with C, switching between Dalvik and ART had no impact on the run times.

RenderScript also offers the possibility to easily parallelize the code and/or run it on the GPU which can potentially give a huge speedup, though unfortunately we weren’t able to take advantage of that here since this particular DSP algorithm is not trivially parallelizable. However, for other algorithms like a simple gain, RenderScript can give a significant boost with small changes to the code, and without having to worry about threading or other such headaches.

In my humble view, the RenderScript implementation does need some more polishing and the documentation needs to be significantly improved, as I doubt I would have gotten it working on my own without help. Here are some of the issues that I ran into with the RenderScript port:

  • Not all functions are documented. For example, the algorithm uses rsSetElementAt_short() which I can’t find anywhere except for some obscure C files in the Android source code.
  • The allocation functions are missing a way to copy data into an offset of an array. To work around this, I use a scratch buffer and System.arraycopy() to move the data around, and to keep things fair, I changed the other implementations to work in the same way. While this slows them down slightly, I don’t believe it’s an unfair advantage for RenderScript, because in real-world usage, I would expect to process the data coming off the microphone and write that directly into a file, not into an offset of some array.
  • The fastest RenderScript implementation only works on Android 4.4 KitKat devices. Going down one version to Android 4.3 changes the RenderScript API which requires me to change the code slightly, slowing things down for both 4.3 and 4.4. RenderScript does offer a “support” mode via the support API which should enable backwards compatibility, but I wasn’t able to get this to work for me for APIs older than 18 (Android 4.3).

So while there are some issues with RenderScript as implemented today, these are all issues that can hopefully be fixed. RenderScript also has the significant advantage of running code on the CPU and GPU in parallel, and doesn’t require JNI glue code. This makes it a serious contender to C, especially if portability to older devices or other platforms is not a big concern.

Java

As with last time, Java fills out the bottom of the pack. The performance is especially terrible with the default Dalvik implementation; in fact, it would be even worse if I hadn’t manually replaced the modulo operator with a bit mask, which I was hoping the compiler could do with the static information available to it, but it doesn’t.

Some people asked about Proguard, so I tried it out with the following config (full details in the test project):

-optimizationpasses 5
-allowaccessmodification
-dontpreverify

-dontusemixedcaseclassnames
-dontskipnonpubliclibraryclasses
-verbose

The results were mixed. Switching between Dalvik and ART made much more of a difference, as did manually inlining all of the functions together. The best result with Dalvik was without Proguard, and was 3.11x slower than the best C implementation. The best result with ART was with Proguard, and was 2.44x slower than the best C implementation. If we compare the normal Java version to the best C result, we get a 5.02x slowdown with ART and a 14.45x slowdown with Dalvik.

It does look like the performance of Java will be getting a lot better once ART becomes widely deployed; instead of huge slowdowns, we’ll be seeing between 3x and 5x, which does make a difference. I can already see the improvements when sorting and displaying ListViews in UI code, so this isn’t just something that affects niche code like audio DSP filters.

Desktop results (just for fun)

Just like last time, again, here are some desktop results, just for fun. 🙂 These tests were run on a 2.6 GHz Intel Core i7 running OS X 10.9.3.

Implementation Execution environment Compiler Shorts/second Relative speed
(higher is better)
C Java SE 1.6u65 JNI gcc 4.9 129,909,662 7.36
C Java SE 1.6u65 JNI clang 3.4 96,022,644 5.44
Java Java SE 1.8u5 (+XX:AggressiveOpts) 82,988,332 4.70
Java (tuned) Java SE 1.8u5 (+XX:AggressiveOpts) 79,288,025 4.50
Java Java SE 1.8u5 64,964,399 3.68
Java (tuned) Java SE 1.8u5 64,748,201 3.67
Java (tuned) Java SE 1.6u65 63,965,575 3.63
Java Java SE 1.6u65 53,245,864 3.02

 

As on the Nexus 5, the C implementation compiled with gcc dominates; however, I’m very impressed with where Java ended up!

C

I used the following compilers with optimization flags -march=native -ffast-math -O3:

  • Apple LLVM version 5.1 (clang-503.0.40) (based on LLVM 3.4svn)
  • gcc version 4.9.0 20140416 (prerelease) (MacPorts gcc49 4.9-20140416_2)

As on the Nexus 5, gcc’s generated code is much faster than clang’s; perhaps this will change in the future but for now, gcc is still the king. I also find it interesting that the gap between the best run time here and the best run time on the Nexus 5 is similar to the gap between C and ART on the Nexus 5. Not so far apart, they are!

Java

I’m also impressed with the latest Java for OS X. While manually inlining all of the functions together was required for an improvement on Java 1.6, the manually-inlined version was actually slower on Java 1.8. This shows that not only is this sort of code abuse no longer required on the latest Java, but also that the compiler is smarter than we are at optimizing the code.

Adding +XX:AggressiveOpts to Java 1.8 sped things up even more, almost closing the gap with clang! That is very impressive in my eyes, since Java has an old reputation of being a slow language, but in some cases and situations, it can be almost as fast as C if not faster.

The worst Java performance is 2.43x slower than the best C performance, which is about the same relative difference as the best Java performance on Android with ART. Performance differences aren’t always just about language choice; they can also be very dependent on the quality of implementation. At this time, the Google team has made different trade-offs which place ART at around the same relative level of performance, for this specific test case, as Java 1.6. The improved performance of Java 1.8 on the desktop shows that it’s clearly possible to close up the gap on Android in the future.

Explore the code!

The project can be downloaded at GitHub: https://github.com/learnopengles/dsp-perf-test. To compile the code, download or clone the repository and import the projects into Eclipse with File->Import->Existing Projects Into Workspace. If the Android project is missing references, go to its properties, Java Build Path, Projects, and add “JavaPerformanceTest”.

The results are written to “PerformanceTest/” on the device’s default storage, so please double-check that you don’t have anything there before running the tests.

So, what do you think? Does it make sense to drop down into native code? Or are native languages a relic of the past, and there’s no reason to use anything other than modern, safe languages? I would love to hear your feedback.

OpenGL Roundup, April 10, 2014: GDC 2014 Report, libgdx 1.0, Data-Oriented Design and More…

Top stories

GDC 2014 Report

libgdx: We’ll go 1.0 next weekend!

Recent posts

A Performance Comparison Between Java and C on the Nexus 5

How Powerful Is Your Nexus 7?

Finishing up Our Native Air Hockey Project with Touch Events and Basic Collision Detection

Android native development

Android on x86: Java Native Interface and the Android Native Development Kit 

jnigen wiki page

Wrapping a C++ library with JNI – introduction

Game industry & development

How In-app Purchases Have Destroyed The Industry

How in-app purchase is not really destroying the games industry

How to become a Graphics Programmer in the games industry

The indie roadmap

You Don’t Need Millions of Dollars

Online books and references

Data-Oriented Design

Game Programming Patterns

Platform/GFX/MobileGPUs

OpenGL articles & tutorials

OpenGL dumb mistakes: the mysterious Perfect Circular Hole

GLKit to the max: OpenGL ES 2.0 for iOS

Web development

Asset loading in emscripten and PNaCl

Compiling to the Web

First 3D Commercial Web Game Powered By asm.js Unveiled

On Asm.js

Playing With Emscripten and ASM.js

Misc

Farewell DirectX

Modern C++: What you need to know

Never Again in Graphics: Unforgivable graphic curses.

Support RoboVM (and get Java 8 and other Goodies)

A performance comparison between Java and C on the Nexus 5

Android phones have been growing ever more powerful with time, with the Nexus 5 sporting a quad-core 2.3 GHz Krait 400; this is a very powerful CPU for a mobile phone. With most Android apps being written in Java, does Java allow us to access all of that power? Or, put another way, is Java efficient enough, allowing tasks to complete more quickly and allowing the CPU to idle more, saving precious battery life?

(Note: An updated version of this comparison is available at A Performance Comparison Redux: Java, C, and Renderscript on the Nexus 5, along with source code).

In this post, I will take a look at a DSP filter adapted from coefficients generated with mkfilter, and compare three different implementations: one in C, one in Java, and one in Java with some manual optimizations. The source for these tests can be downloaded at the end of this post.

To compare the results, I ran the filter over an array of random data on the Nexus 5, and the compared the results to the fastest implementation. In the following table, a lower runtime is better, with the fastest implementation getting a relative runtime of 1.

Execution environment Options Relative runtime (lower is better)
gcc 4.8 1.00
gcc 4.8 (LOCAL_ARM_NEON := true) -ffast-math -O3 1.02
gcc 4.8 -ffast-math -O3 1.05
clang 3.4 (LOCAL_ARM_NEON := true) -ffast-math -O3 1.27
clang 3.4 -ffast-math -O3 1.42
clang 3.4 1.43
ART (manually-optimized) 2.22
Dalvik (manually-optimized) 2.87
ART (normal code) 7.99
Dalvik (normal code) 17.78

The statically-compiled C code gave the best execution times, followed by ART and then by Dalvik. The C code uses JNI via GetShortArrayRegion and SetShortArrayRegion to marshal the data from Java to C, and then back from C to Java once processing has completed.

The best performance came courtesy of GCC 4.8, with little variation between the different additional optimization options. Clang’s ARM builds are not quite as optimized as GCC’s; toggling LOCAL_ARM_NEON := true in the NDK makefile also makes a clear difference in performance.

Even the slowest native build using clang is not more than 43% slower than the best native build using gcc. Once we switch to Java, the variance starts to increase significantly, with the best runtime about 2.2x slower than native code, and the worst runtime a staggering 17.8x slower.

What explains the large difference? For one, it appears that both ART and Dalvik are limited in the amount of static optimizations that they are capable of. This is understandable in the case of Dalvik, since it uses a JIT and it’s also much older, but it is disappointing in the case of ART, since it uses ahead-of-time compilation.

Is there a way to speed up the Java code? I decided to try it out, by applying the same static optimizations I would have expected the compiler to do, like converting modulo to bit masks and inlining function calls. These changes resulted in one massive and hard to read function, but they also dramatically improved the runtime performance, with Dalvik speeding up from a 17.8x penalty to 2.9x, and ART speeding up from an 8.0x penalty to 2.2x.

The downside of this is that the code has to be abused to get this additional performance, and it still doesn’t come close to matching the ahead-of-time code generated by gcc and clang, which can surpass that performance without similar abuse of the code. The NDK is still a viable option for those looking for improved performance and more efficient code which consumes less battery over time.

Just for fun, I decided to try things out on a laptop with a 2.6 GHz Intel Core i7. For this table, the relative results are in the other direction, with 1x corresponding to the best time on the Nexus 5, 2x being twice as fast, and so on. The table starts with the best results first, as before.

Execution environment Options Relative speed (higher is better)
clang 3.4 -O3 -ffast-math -flto 8.38x
clang 3.4 -O3 -ffast-math 6.09x
Java SE 1.7u51 (manually-optimized) -XX:+AggressiveOpts 5.25x
Java SE 1.6u65 (manually-optimized) 3.85x
Java SE 1.6 (normal code) 2.44x

As on the Nexus 5, the C code runs faster, but to Java’s credit, the gap between the best & worst result is less than 4x, which is much less variance than we see with Dalvik or ART. Java 1.6 and 1.7 are very close to each other, unless “-XX:+AggressiveOpts” is used; with that option enabled, 1.7 is able to pull ahead.

There is still an unfortunate gap between the “normal” code and the manually-optimized code, which really should be closable with static analysis and inlining.

The other interesting result is that the gap between mobile and PC is closing over time, and even more so if you take power consumption into account. It’s quite impressive to see that as far as single-core performance goes, the PC and smartphone are closer than ever.

Conclusion

Recent Android devices are getting very powerful, and with the new ART runtime, common Java code can be executed quickly enough to keep user interfaces responsive and users happy.

Sometimes, though, we need to go further, and write demanding code that needs to run quickly and efficiently. With the latest Android devices, these algorithms may be able to run quickly enough in the Dalvik VM or with ART, but then we have to ask ourselves: is the benefit of using a single language worth the cost of lower performance? This isn’t just an academic question: lower performance means that we need to ask our users to give us more CPU cycles, which shortens their device’s battery life, heats up their phones, and makes them wait longer for results, and all because we didn’t want to write the code in another language.

For these reasons, writing some of our code in C/C++, FORTRAN, or another native language can still make a lot of sense.

For more reading on this topic, check out How Powerful is Your Nexus 7?

Source

dsp.c
#include "dsp.h"
#include <algorithm>
#include <cstdint>
#include <limits>

static constexpr int int16_min = std::numeric_limits<int16_t>::min();
static constexpr int int16_max = std::numeric_limits<int16_t>::max();

static inline int16_t clamp(int input)
{
     return std::max(int16_min, std::min(int16_max, input));
}

static inline int get_offset(const FilterState& filter_state, int relative_offset)
{
     return (filter_state.current + relative_offset) % filter_state.size;
}

static inline void push_sample(FilterState& filter_state, int16_t sample)
{
     filter_state.input[get_offset(filter_state, 0)] = sample;
     ++filter_state.current;
}

static inline int16_t get_output_sample(const FilterState& filter_state)
{
     return clamp(filter_state.output[get_offset(filter_state, 0)]);
}

static inline void apply_lowpass(FilterState& filter_state)
{
     double* x = filter_state.input;
     double* y = filter_state.output;

     y[get_offset(filter_state, 0)] =
       (  1.0 * (1.0 / 6.928330802e+06) * (x[get_offset(filter_state, -10)] + x[get_offset(filter_state,  -0)]))
     + ( 10.0 * (1.0 / 6.928330802e+06) * (x[get_offset(filter_state,  -9)] + x[get_offset(filter_state,  -1)]))
     + ( 45.0 * (1.0 / 6.928330802e+06) * (x[get_offset(filter_state,  -8)] + x[get_offset(filter_state,  -2)]))
     + (120.0 * (1.0 / 6.928330802e+06) * (x[get_offset(filter_state,  -7)] + x[get_offset(filter_state,  -3)]))
     + (210.0 * (1.0 / 6.928330802e+06) * (x[get_offset(filter_state,  -6)] + x[get_offset(filter_state,  -4)]))
     + (252.0 * (1.0 / 6.928330802e+06) *  x[get_offset(filter_state,  -5)])

     + (  -0.4441854896 * y[get_offset(filter_state, -10)])
     + (   4.2144719035 * y[get_offset(filter_state,  -9)])
     + ( -18.5365677633 * y[get_offset(filter_state,  -8)])
     + (  49.7394321983 * y[get_offset(filter_state,  -7)])
     + ( -90.1491003509 * y[get_offset(filter_state,  -6)])
     + ( 115.3235358151 * y[get_offset(filter_state,  -5)])
     + (-105.4969191433 * y[get_offset(filter_state,  -4)])
     + (  68.1964705422 * y[get_offset(filter_state,  -3)])
     + ( -29.8484881821 * y[get_offset(filter_state,  -2)])
     + (   8.0012026712 * y[get_offset(filter_state,  -1)]);
}

void apply_lowpass(FilterState& filter_state, const int16_t* input, int16_t* output, int length)
{
     for (int i = 0; i < length; ++i) {
          push_sample(filter_state, input[i]);
          apply_lowpass(filter_state);
          output[i] = get_output_sample(filter_state);
     }
}
dsp.h
#include <cstdint>

struct FilterState {
	static constexpr int size = 16;

    double input[size];
    double output[size];
	unsigned int current;

	FilterState() : input{}, output{}, current{} {}
};

void apply_lowpass(FilterState& filter_state, const int16_t* input, int16_t* output, int length);

Here is the Java adaptation of the C code:

package com.example.perftest;

import com.example.perftest.DspJavaManuallyOptimized.FilterState;

public class DspJava {
	public static class FilterState {
		static final int size = 16;

		final double input[] = new double[size];
		final double output[] = new double[size];

		int current;
	}

	static short clamp(short input) {
		return (short) Math.max(Short.MIN_VALUE, Math.min(Short.MAX_VALUE, input));
	}

	static int getOffset(FilterState filterState, int relativeOffset) {
		return ((filterState.current + relativeOffset) % FilterState.size + FilterState.size) % FilterState.size;
	}

	static void pushSample(FilterState filterState, short sample) {
		filterState.input[getOffset(filterState, 0)] = sample;
		++filterState.current;
	}

	static short getOutputSample(FilterState filterState) {
		return clamp((short) filterState.output[getOffset(filterState, 0)]);
	}
	
	static void applyLowpass(FilterState filterState) {
		final double[] x = filterState.input;
		final double[] y = filterState.output;

		y[getOffset(filterState, 0)] =
		   (  1.0 * (1.0 / 6.928330802e+06) * (x[getOffset(filterState, -10)] + x[getOffset(filterState,  -0)]))
		 + ( 10.0 * (1.0 / 6.928330802e+06) * (x[getOffset(filterState,  -9)] + x[getOffset(filterState,  -1)]))
		 + ( 45.0 * (1.0 / 6.928330802e+06) * (x[getOffset(filterState,  -8)] + x[getOffset(filterState,  -2)]))
		 + (120.0 * (1.0 / 6.928330802e+06) * (x[getOffset(filterState,  -7)] + x[getOffset(filterState,  -3)]))
		 + (210.0 * (1.0 / 6.928330802e+06) * (x[getOffset(filterState,  -6)] + x[getOffset(filterState,  -4)]))
		 + (252.0 * (1.0 / 6.928330802e+06) *  x[getOffset(filterState,  -5)])

		 + (  -0.4441854896 * y[getOffset(filterState, -10)])
		 + (   4.2144719035 * y[getOffset(filterState,  -9)])
		 + ( -18.5365677633 * y[getOffset(filterState,  -8)])
		 + (  49.7394321983 * y[getOffset(filterState,  -7)])
		 + ( -90.1491003509 * y[getOffset(filterState,  -6)])
		 + ( 115.3235358151 * y[getOffset(filterState,  -5)])
		 + (-105.4969191433 * y[getOffset(filterState,  -4)])
		 + (  68.1964705422 * y[getOffset(filterState,  -3)])
		 + ( -29.8484881821 * y[getOffset(filterState,  -2)])
		 + (   8.0012026712 * y[getOffset(filterState,  -1)]);
	}

	public static void applyLowpass(FilterState filterState, short[] input, short[] output, int length) {
		for (int i = 0; i < length; ++i) {
			pushSample(filterState, input[i]);
			applyLowpass(filterState);
			output[i] = getOutputSample(filterState);
		}
	}
}

Since all of the Java runtimes tested don’t exploit static optimization opportunities as well as it seems that they could, here is an optimized version that has been inlined and has the modulo replaced with a bit mask:

package com.example.perftest;

public class DspJavaManuallyOptimized {
	public static class FilterState {
		static final int size = 16;

		final double input[] = new double[size];
		final double output[] = new double[size];

		int current;
	}

	public static void applyLowpass(FilterState filterState, short[] input, short[] output, int length) {
		for (int i = 0; i < length; ++i) {
			filterState.input[(filterState.current + 0) & (FilterState.size - 1)] = input[i];
			++filterState.current;
			final double[] x = filterState.input;
			final double[] y = filterState.output;

			y[(filterState.current + 0) & (FilterState.size - 1)] =
			   (  1.0 * (1.0 / 6.928330802e+06) * (x[(filterState.current + -10) & (FilterState.size - 1)] + x[(filterState.current + -0) & (FilterState.size - 1)]))
			 + ( 10.0 * (1.0 / 6.928330802e+06) * (x[(filterState.current + -9) & (FilterState.size - 1)] + x[(filterState.current + -1) & (FilterState.size - 1)]))
			 + ( 45.0 * (1.0 / 6.928330802e+06) * (x[(filterState.current + -8) & (FilterState.size - 1)] + x[(filterState.current + -2) & (FilterState.size - 1)]))
			 + (120.0 * (1.0 / 6.928330802e+06) * (x[(filterState.current + -7) & (FilterState.size - 1)] + x[(filterState.current + -3) & (FilterState.size - 1)]))
			 + (210.0 * (1.0 / 6.928330802e+06) * (x[(filterState.current + -6) & (FilterState.size - 1)] + x[(filterState.current + -4) & (FilterState.size - 1)]))
			 + (252.0 * (1.0 / 6.928330802e+06) *  x[(filterState.current + -5) & (FilterState.size - 1)])

			 + (  -0.4441854896 * y[(filterState.current + -10) & (FilterState.size - 1)])
			 + (   4.2144719035 * y[(filterState.current + -9) & (FilterState.size - 1)])
			 + ( -18.5365677633 * y[(filterState.current + -8) & (FilterState.size - 1)])
			 + (  49.7394321983 * y[(filterState.current + -7) & (FilterState.size - 1)])
			 + ( -90.1491003509 * y[(filterState.current + -6) & (FilterState.size - 1)])
			 + ( 115.3235358151 * y[(filterState.current + -5) & (FilterState.size - 1)])
			 + (-105.4969191433 * y[(filterState.current + -4) & (FilterState.size - 1)])
			 + (  68.1964705422 * y[(filterState.current + -3) & (FilterState.size - 1)])
			 + ( -29.8484881821 * y[(filterState.current + -2) & (FilterState.size - 1)])
			 + (   8.0012026712 * y[(filterState.current + -1) & (FilterState.size - 1)]);
			output[i] = (short) Math.max(Short.MIN_VALUE, Math.min(Short.MAX_VALUE, (short) filterState.output[(filterState.current + 0) & (FilterState.size - 1)]));
		}
	}
}

How Powerful Is Your Nexus 7?

The following post is based on a paper generously contributed by Jerome Huck, a senior aerospace/defence engineer, scientist, and author. A link to figures and the code can be found at the bottom of this post.

So you want to run some heavy-duty algorithms on your Android device, and you’re wondering what is the best environment to use, and whether your Nexus 7 tablet would be up to the job. In this post, based upon a paper generously contributed by Jerome Huck, a senior aerospace engineer & scientist, we’ll take a look at a test involving some heavy-duty computational fluid dynamics equations, and we’ll compare the execution times on a PC and a Nexus 7 tablet.

Implementation languages

Which language & development environment is the best fit? The Eclipse SDK is one obvious choice to go, with development usually done in Java. Unlocking additional performance through native C & C++ code can also be done via the Native Development Kit (NDK), though this adds complexity due to the mixing of Java, C/C++, and JNI glue code.

What if you want to develop directly on your device? Thanks to the openness of Google Play, there are many options available. The Java AIDE will allow you to write, compile and run programs directly on your Android tablet.

For native performance, C/C++ are available through C4DROID and CCTOOLS, an implementation of the GNU GCC 4.8.1 compiler. Fortran is also available for download from CCTOOLS’s menu.

Python development is available via QPython or Kivy. QPython implements the Python language but is still in beta stage; the Kivy Launcher enables you to run Kivy applications, an open source Python library for rapid development. Kivy applications can also make use of innovative user interfaces, including multi-touch apps.

So, just how powerful is your Nexus 7? Java, Basic, C/C++, Python and Fortran all seem like good candidates to evaluate the power of a Nexus 7 with a test case.

The Test Case

The test developed by Jerome involves some heavy-duty math, of the type often employed by scientists and engineers. Here are the details, as specified by Jerome and edited for formatting within this post:

For evaluating the performance, let’s use a test case using computational fluid dynamics algorithms, including Navier-Stokes fluid equations, the Maxwell electromagnetism equations, forming the magnetohydrodynamics (MHD) set of equations. The original Fortran code was published in An Introduction to Computational Fluid Mechanics by Chuen-Yen-Chow, in 1983. The MHD stationary flow calculation is no longer included in the 2011 update by Biringen and Chow, but the details pertaining to the equations discretization, stability analysis, and so on can still be found in their Benard and Taylor instabilities example of the instationary solution of Navier-Stokes equations coupled with the temperature equation.

For simplicity, a stream-vorticity formulation is used. Standard boundary conditions, or even simplified ones, are used, with a value or derivative given. Discretization of the nonlinear terms in the Navier-Stokes, the one involving the velocity components, was, historically, a source of problems. The numerical scheme has to properly capture the flow direction.

Upwind differencing form solves this problem. The spatial difference is on the upwind side of the point indexed (i,j). This numerical scheme is only first order by reference to a Taylor series expansion. The second order upwind schemes introduces non physical behaviour, such as oscillations. Total Variation Diminishing (TVD) schemes are a cure to this problem. They introduce stable, non-oscillatory, high order schemes, preserving monotonicity, with no overshot or undershoot for the solution. They are the result of more than 30 years of research in CFD.

Only the upwind scheme was present in the original Fortran code. It was rewritten using a general TVD formulation. Corner Transport Upwind (CTU) was also added as an experiment, and not fully tested. Details can be in good CFD books such as An Introduction to Computational Fluid Dynamics: The Finite Volume Method (2nd Edition) by Versteeg and Malalasekera, or Finite Volume Methods for Hyperbolic Problems by Leveque.

The solution procedure is straightforward. The current flow, RH variable, is solved via the Laplace equation solver. Then the electromagnetic force, EM variable, is computed. Time stepping is used to find the solution of the flow until the convergence criteria are matched, error or maximum step. A Poisson solver is used.

Comments are given in the Fortran source code.

The results are presented for a Reynolds number of 50, a magnetic pressure number C of 0.3, using the upwind scheme.

Execution times on a PC

Before looking at the Nexus 7 results, let’s first compare the results on the PC. Here are the results that Jerome obtained on a i3 2.1 GHz laptop, running Windows 7 64-bit:

GNU Fortran 62ms
GNU GCC 78ms
Oracle Java JDK 7u45 150ms
PyPy 2.0 1020ms
Python 3.3.2 6780ms

For this particular run, Fortran is the best, with C a close second; the Java JDK also put in a good showing here. The interpreted languages are very disappointing, as expected.

Even with the slower execution times, some scientists are still moving some their code to Python; they want to benefit from the scripting capabilities of interpreted languages. Users don’t need to edit the source code to change the boundary equations or add a subroutine to solve a particular equation. FiPy, a finite volume code from NIST, uses this approach. However, most of the critical parts are still written in C or in Fortran.

Another approach is to use a dedicated language such the one implemented in FreeFem++, a partial differential equation solver. With this tool, a problem with one billion unknowns was solved in 2 minutes on the Curie Thin Node, a CEA machine. The CEA is the French Atomic Energy and Alternative Energies Commission.

What does the Nexus 7 has to offer?

Let’s now take a look at the results on a 1.2 GHz 2012 Nexus 7; the 2013 model, with its Qualcomm Snapdragon S4 Pro at 1.5 GHz, may boost these results a step further.

Fortran CCTOOLS with -march=armv7-a -mfloat-abi=softfp -mfpu=vfpv3 -03 70ms
Fortran CCTOOLS with -march=armv7-a -mfloat-abi=softfp -mfpu=vfpv3 -02 79ms
C99 C4DROID with -march=armv7-a -mfloat-abi=softfp -mfpu=vfpv3 -02 (64-bit floats) 120ms
C99 C4DROID with -march=armv7-a -mfloat-abi=softfp -mfpu=vfpv3 (32-bit floats) 380ms
C99 C4DROID with mfloat-abi=softfp (32-bit floats) 394ms
C99 C4DROID with -march=armv7-a -mfloat-abi=softfp -mfpu=vfpv3 (32-bit floats) 420ms
C99 C4DROID with mfloat-abi=softfp (64-bit floats) 450ms
C4DROID with -msoft-float (32-bit floats) 1163ms
C4DROID with -msoft-float (64-bit floats) 1500ms
Java compiled with Eclipse 1563ms
Java AIDE with dex optimizations 2100ms
Java AIDE 3030ms
QPython 24702ms

These are the best execution times. Some variance was seen with C4DROID, while CCTOOLS was more stable overall. As before, we can see the same ranking, with Fortran emerging as the leader, and C, Java, and Python following behind. With the proper compiler flags, CCTOOLS Fortran is even competitive against the PC, which is a very good result.

The Java results, on the other hand, are quite bad. Is it a fault of the Dalvik virtual machine? Results may improve with the ART runtime, but they’d have to improve dramatically to come close to the performance of optimized FORTRAN and C.

Python, with an execution time of over 24 seconds, can definitely be forgotten for serious scientific computations.

Verdict

The Nexus 7 2012 is very powerful on this particular test, when running Fortran or C code compiled to native machine code. Can these good results be extrapolated to more demanding programs, and software that needs more time to run?

The Nexus 7 tablets are very high-quality products, and Android is a smart and fun operating system to use. The 2012 model is already quite powerful, and the 2013 should see even better results; all that’s needed is a dedicated approach to unleash the power sleeping within those processors.

Paper, equations, and code

This blog post is based on work generously contributed by Jerome Huck, a senior aerospace/defence engineer, scientist, and author. Jerome graduated from the École nationale supérieure de l’aéronautique et de l’espace in Toulouse, and has worked on various projects including the Hermes space shuttle, Rafale fighter, and is the author of “The Fire of the Magicians“.

Loading a PNG into Memory and Displaying It as a Texture with OpenGL ES 2, Using (Almost) the Same Code on iOS, Android, and Emscripten

In the last post in this series, we setup a system to render OpenGL to Android, iOS and the web via WebGL and emscripten. In this post, we’ll expand on that work and add support for PNG loading, shaders, and VBOs.

TL;DR

We can put most of our common code into a core folder, and call into that core from a main loop in our platform-specific code. By taking advantage of open source libraries like libpng and zlib, most of our code can remain platform independent. In this post, we cover the new core code and the new Android platform-specific code.

To check out the completed project for this part of the series, head over to GitHub and download the files for ‘article-2-loading-png-file’.

Prerequisites

Before we begin, you may want to check out the previous posts in this series so that you can get the right tools installed and configured on your local development machine:

You can setup a local git repository with all of the code by cloning ‘article-1-clearing-the-screen’ or by downloading it as a ZIP from GitHub: https://github.com/learnopengles/airhockey/tree/article-1-clearing-the-screen.

For a “friendlier” introduction to OpenGL ES 2 using Java as the development language of choice, you can also check out Android Lesson One: Getting Started or OpenGL ES 2 for Android: A Quick-Start Guide.

Updating the platform-independent code

In this section, we’ll cover all of the new changes to the platform-independent core code that we’ll be making to support the new features. The first thing that we’ll do is move things around, so that they follow this new structure:

/src/common => rename to /src/core

/src/android => rename to /src/platform/android

/src/ios => rename to /src/platform/ios

/src/emscripten => rename to /src/platform/emscripten

We’ll also rename glwrapper.h to platform_gl.h for all platforms. This will help to keep our source code more organized as we add more features and source files.

To start off, let’s cover all of the source files that go into /src/core.

Loading vertex buffer objects

Let’s begin with buffer.h:

#include "platform_gl.h"

#define BUFFER_OFFSET(i) ((void*)(i))

GLuint create_vbo(const GLsizeiptr size, const GLvoid* data, const GLenum usage);

We’ll use create_vbo to upload data into a vertex buffer object. BUFFER_OFFSET() is a helper macro that we’ll use to pass the right offsets to glVertexAttribPointer().

Let’s follow up with the implementation in buffer.c:

#include "buffer.h"
#include "platform_gl.h"
#include <assert.h>
#include <stdlib.h>

GLuint create_vbo(const GLsizeiptr size, const GLvoid* data, const GLenum usage) {
	assert(data != NULL);
	GLuint vbo_object;
	glGenBuffers(1, &vbo_object);
	assert(vbo_object != 0);

	glBindBuffer(GL_ARRAY_BUFFER, vbo_object);
	glBufferData(GL_ARRAY_BUFFER, size, data, usage);
	glBindBuffer(GL_ARRAY_BUFFER, 0);

	return vbo_object;
}

First, we generate a new OpenGL vertex buffer object, and then we bind to it and upload the data from data into the VBO. We also assert that the data is not null and that we successfully created a new vertex buffer object. Why do we assert instead of returning an error code? There are a couple of reasons for that:

  1. In the context of a game, there isn’t really a reasonable course of action that we can take in the event that creating a new VBO fails. Something is going to fail to display properly, so our game experience isn’t going to be as intended. We would also never expect this to fail, unless we’re abusing the platform and trying to do too much for the target hardware.
  2. Returning an error means that we now have to expand our code by handling the error and checking for the error at the other end, perhaps cascading that across several function calls. This adds a lot of maintenance burden with little gain.

I have been greatly influenced by this excellent series over at the Bitsquid blog:

assert() is only compiled into the program in debug mode by default, so in release mode, the application will just continue to run and might end up crashing on bad data. To avoid this, when going into production, you may want to create a special assert() that works in release mode and does a little bit more, perhaps showing a dialog box to the user before crashing and writing out a log to a file, so that it can be sent off to the developers.

Loading and compiling shaders:

Let’s add the following shader.h:

#include "platform_gl.h"

GLuint compile_shader(const GLenum type, const GLchar* source, const GLint length);
GLuint link_program(const GLuint vertex_shader, const GLuint fragment_shader);
GLuint build_program(
	const GLchar * vertex_shader_source, const GLint vertex_shader_source_length,
	const GLchar * fragment_shader_source, const GLint fragment_shader_source_length);

/* Should be called just before using a program to draw, if validation is needed. */
GLint validate_program(const GLuint program);

Here, we have methods to compile a shader and to link two shaders into an OpenGL shader program. We also have a helper method here for validating a program, if we want to do that for debugging reasons.

Let’s begin the implementation for shader.c:

#include "shader.h"
#include "platform_gl.h"
#include "platform_log.h"
#include <assert.h>
#include <stdlib.h>
#include <string.h>

#define TAG "shaders"

static void log_v_fixed_length(const GLchar* source, const GLint length) {
	if (LOGGING_ON) {
		char log_buffer[length + 1];
		memcpy(log_buffer, source, length);
		log_buffer[length] = '\0';

		DEBUG_LOG_WRITE_V(TAG, log_buffer);
	}
}

static void log_shader_info_log(GLuint shader_object_id) {
	if (LOGGING_ON) {
		GLint log_length;
		glGetShaderiv(shader_object_id, GL_INFO_LOG_LENGTH, &log_length);
		GLchar log_buffer[log_length];
		glGetShaderInfoLog(shader_object_id, log_length, NULL, log_buffer);

		DEBUG_LOG_WRITE_V(TAG, log_buffer);
	}
}

static void log_program_info_log(GLuint program_object_id) {
	if (LOGGING_ON) {
		GLint log_length;
		glGetProgramiv(program_object_id, GL_INFO_LOG_LENGTH, &log_length);
		GLchar log_buffer[log_length];
		glGetProgramInfoLog(program_object_id, log_length, NULL, log_buffer);

		DEBUG_LOG_WRITE_V(TAG, log_buffer);
	}
}

We’ve added some helper functions to help us log the shader and program info logs when logging is enabled. We’ll define LOGGING_ON and the other logging functions in other include files, soon. Let’s continue:

GLuint compile_shader(const GLenum type, const GLchar* source, const GLint length) {
	assert(source != NULL);
	GLuint shader_object_id = glCreateShader(type);
	GLint compile_status;

	assert(shader_object_id != 0);

	glShaderSource(shader_object_id, 1, (const GLchar **)&source, &length);
	glCompileShader(shader_object_id);
	glGetShaderiv(shader_object_id, GL_COMPILE_STATUS, &compile_status);

	if (LOGGING_ON) {
		DEBUG_LOG_WRITE_D(TAG, "Results of compiling shader source:");
		log_v_fixed_length(source, length);
		log_shader_info_log(shader_object_id);
	}

	assert(compile_status != 0);

	return shader_object_id;
}

We create a new shader object, pass in the source, compile it, and if everything was successful, we then return the shader ID. Now we need a method for linking two shaders together into an OpenGL program:

GLuint link_program(const GLuint vertex_shader, const GLuint fragment_shader) {
	GLuint program_object_id = glCreateProgram();
	GLint link_status;

	assert(program_object_id != 0);

	glAttachShader(program_object_id, vertex_shader);
	glAttachShader(program_object_id, fragment_shader);
	glLinkProgram(program_object_id);
	glGetProgramiv(program_object_id, GL_LINK_STATUS, &link_status);

	if (LOGGING_ON) {
		DEBUG_LOG_WRITE_D(TAG, "Results of linking program:");
		log_program_info_log(program_object_id);
	}

	assert(link_status != 0);

	return program_object_id;
}

To link the program, we pass in two OpenGL shader objects, one for the vertex shader and one for the fragment shader, and then we link them together. If all was successful, then we return the program object ID.

Let’s complete shader.c by adding two helper methods:

GLuint build_program(
	const GLchar * vertex_shader_source, const GLint vertex_shader_source_length, 
	const GLchar * fragment_shader_source, const GLint fragment_shader_source_length) {
	assert(vertex_shader_source != NULL);
	assert(fragment_shader_source != NULL);

	GLuint vertex_shader = compile_shader(
		GL_VERTEX_SHADER, vertex_shader_source, vertex_shader_source_length);
	GLuint fragment_shader = compile_shader(
		GL_FRAGMENT_SHADER, fragment_shader_source, fragment_shader_source_length);
	return link_program(vertex_shader, fragment_shader);
}

This helper method method takes in the source for a vertex shader and a fragment shader, and returns the linked program object. Let’s add the second helper method:

GLint validate_program(const GLuint program) {
	if (LOGGING_ON) {
		int validate_status;

		glValidateProgram(program);
		glGetProgramiv(program, GL_VALIDATE_STATUS, &validate_status);
		DEBUG_LOG_PRINT_D(TAG, "Results of validating program: %d", validate_status);
		log_program_info_log(program);
		return validate_status;
	}

	return 0;
}

We can use validate_program() for debugging purposes, if we want some extra info about a program during a specific moment in our rendering code.

Loading in textures

Now we need some code to load in raw data into a texture. Let’s add the following into a new file called texture.h:

#include "platform_gl.h"

GLuint load_texture(
	const GLsizei width, const GLsizei height,
	const GLenum type, const GLvoid* pixels);

Let’s follow that up with the implementation in texture.c:

#include "texture.h"
#include "platform_gl.h"
#include <assert.h>

GLuint load_texture(
	const GLsizei width, const GLsizei height,
	const GLenum type, const GLvoid* pixels) {
	GLuint texture_object_id;
	glGenTextures(1, &texture_object_id);
	assert(texture_object_id != 0);

	glBindTexture(GL_TEXTURE_2D, texture_object_id);

	glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_LINEAR_MIPMAP_LINEAR);
	glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_LINEAR);
	glTexImage2D(
		GL_TEXTURE_2D, 0, type, width, height, 0, type, GL_UNSIGNED_BYTE, pixels);
	glGenerateMipmap(GL_TEXTURE_2D);

	glBindTexture(GL_TEXTURE_2D, 0);
	return texture_object_id;
}

This is pretty straightforward and not currently customized for special cases: it just loads in the raw data in pixels into the texture, assuming that each component is 8-bit. It then sets up the texture for trilinear mipmapping.

Loading in PNG files

For this post, we’ll package our texture asset as a PNG file, and use libpng to decode the file into raw data. For that we’ll need to add some wrapper code around libpng so that we can decode a PNG file into raw data suitable for upload into an OpenGL texture.

Let’s create a new file called image.h, with the following contents:

#include "platform_gl.h"

typedef struct {
	const int width;
	const int height;
	const int size;
	const GLenum gl_color_format;
	const void* data;
} RawImageData;

/* Returns the decoded image data, or aborts if there's an error during decoding. */
RawImageData get_raw_image_data_from_png(const void* png_data, const int png_data_size);
void release_raw_image_data(const RawImageData* data);

We’ll use get_raw_image_data_from_png() to read in the PNG data from png_data and return the raw data in a struct. When we no longer need to keep that raw data around, we can call release_raw_image_data() to release the associated resources.

Let’s start writing the implementation in image.c:

#include "image.h"
#include "platform_log.h"
#include <assert.h>
#include <png.h>
#include <string.h>
#include <stdlib.h>

typedef struct {
	const png_byte* data;
	const png_size_t size;
} DataHandle;

typedef struct {
	const DataHandle data;
	png_size_t offset;
} ReadDataHandle;

typedef struct {
	const png_uint_32 width;
	const png_uint_32 height;
	const int color_type;
} PngInfo;

We’ve started off with the includes and a few structs that we’ll be using locally. Let’s continue with a few function prototypes:

static void read_png_data_callback(
	png_structp png_ptr, png_byte* png_data, png_size_t read_length);
static PngInfo read_and_update_info(const png_structp png_ptr, const png_infop info_ptr);
static DataHandle read_entire_png_image(
	const png_structp png_ptr, const png_infop info_ptr, const png_uint_32 height);
static GLenum get_gl_color_format(const int png_color_format);

We’ll be using these as local helper functions. Now we can add the implementation for get_raw_image_data_from_png():

RawImageData get_raw_image_data_from_png(const void* png_data, const int png_data_size) {
	assert(png_data != NULL && png_data_size > 8);
	assert(png_check_sig((void*)png_data, 8));

	png_structp png_ptr = png_create_read_struct(
		PNG_LIBPNG_VER_STRING, NULL, NULL, NULL);
	assert(png_ptr != NULL);
	png_infop info_ptr = png_create_info_struct(png_ptr);
	assert(info_ptr != NULL);

	ReadDataHandle png_data_handle = (ReadDataHandle) {{png_data, png_data_size}, 0};
	png_set_read_fn(png_ptr, &png_data_handle, read_png_data_callback);

	if (setjmp(png_jmpbuf(png_ptr))) {
		CRASH("Error reading PNG file!");
	}

	const PngInfo png_info = read_and_update_info(png_ptr, info_ptr);
	const DataHandle raw_image = read_entire_png_image(
		png_ptr, info_ptr, png_info.height);

	png_read_end(png_ptr, info_ptr);
	png_destroy_read_struct(&png_ptr, &info_ptr, NULL);

	return (RawImageData) {
		png_info.width,
		png_info.height,
		raw_image.size,
		get_gl_color_format(png_info.color_type),
		raw_image.data};
}

There’s a lot going on here, so let’s explain each part in turn:

	assert(png_data != NULL && png_data_size > 8);
	assert(png_check_sig((void*)png_data, 8));

This checks that the PNG data is present and has a valid header.

	png_structp png_ptr = png_create_read_struct(
		PNG_LIBPNG_VER_STRING, NULL, NULL, NULL);
	assert(png_ptr != NULL);
	png_infop info_ptr = png_create_info_struct(png_ptr);
	assert(info_ptr != NULL);

This initializes the PNG structures that we’ll use to read in the rest of the data.

	ReadDataHandle png_data_handle = (ReadDataHandle) {{png_data, png_data_size}, 0};
	png_set_read_fn(png_ptr, &png_data_handle, read_png_data_callback);

As the PNG data is parsed, libpng will call read_png_data_callback() for each part of the PNG file. Since we’re reading in the PNG file from memory, we’ll use ReadDataHandle to wrap this memory buffer so that we can read from it as if it were a file.

	if (setjmp(png_jmpbuf(png_ptr))) {
		CRASH("Error reading PNG file!");
	}

This is how libpng does its error handling. If something goes wrong, then setjmp will return true and we’ll enter the body of the if statement. We want to handle this like an assert, so we just crash the program. We’ll define the CRASH macro later on.

	const PngInfo png_info = read_and_update_info(png_ptr, info_ptr);

We’ll use one of our helper functions here to parse the PNG information, such as the color format, and convert the PNG into a format that we want.

	const DataHandle raw_image = read_entire_png_image(
		png_ptr, info_ptr, png_info.height);

We’ll use another helper function here to read in and decode the PNG image data.

	png_read_end(png_ptr, info_ptr);
	png_destroy_read_struct(&png_ptr, &info_ptr, NULL);

	return (RawImageData) {
		png_info.width,
		png_info.height,
		raw_image.size,
		get_gl_color_format(png_info.color_type),
		raw_image.data};

Once reading is complete, we clean up the PNG structures and then we return the data inside of a RawImageData struct.

Let’s define our helper methods now:

static void read_png_data_callback(
	png_structp png_ptr, png_byte* raw_data, png_size_t read_length) {
	ReadDataHandle* handle = png_get_io_ptr(png_ptr);
	const png_byte* png_src = handle->data.data + handle->offset;

	memcpy(raw_data, png_src, read_length);
	handle->offset += read_length;
}

read_png_data_callback() will be called by libpng to read from the memory buffer. To read from the right place in the memory buffer, we store an offset and we increase that offset every time that read_png_data_callback() is called.

static PngInfo read_and_update_info(const png_structp png_ptr, const png_infop info_ptr)
{
	png_uint_32 width, height;
	int bit_depth, color_type;

	png_read_info(png_ptr, info_ptr);
	png_get_IHDR(
		png_ptr, info_ptr, &width, &height, &bit_depth, &color_type, NULL, NULL, NULL);

	// Convert transparency to full alpha
	if (png_get_valid(png_ptr, info_ptr, PNG_INFO_tRNS))
		png_set_tRNS_to_alpha(png_ptr);

	// Convert grayscale, if needed.
	if (color_type == PNG_COLOR_TYPE_GRAY && bit_depth < 8)
		png_set_expand_gray_1_2_4_to_8(png_ptr);

	// Convert paletted images, if needed.
	if (color_type == PNG_COLOR_TYPE_PALETTE)
		png_set_palette_to_rgb(png_ptr);

	// Add alpha channel, if there is none.
	// Rationale: GL_RGBA is faster than GL_RGB on many GPUs)
	if (color_type == PNG_COLOR_TYPE_PALETTE || color_type == PNG_COLOR_TYPE_RGB)
	   png_set_add_alpha(png_ptr, 0xFF, PNG_FILLER_AFTER);

	// Ensure 8-bit packing
	if (bit_depth < 8)
	   png_set_packing(png_ptr);
	else if (bit_depth == 16)
		png_set_scale_16(png_ptr);

	png_read_update_info(png_ptr, info_ptr);

	// Read the new color type after updates have been made.
	color_type = png_get_color_type(png_ptr, info_ptr);

	return (PngInfo) {width, height, color_type};
}

This helper function reads in the PNG data, and then it asks libpng to perform several transformations based on the PNG type:

  • Transparency information is converted into a full alpha channel.
  • Grayscale images are converted to 8-bit.
  • Paletted images are converted to full RGB.
  • RGB images get an alpha channel added, if none is present.
  • Color channels are converted to 8-bit, if less than 8-bit or 16-bit.

The PNG is then updated with the new transformations and the new color type is stored into color_type.

For the next step, we’ll add a helper function to decode the PNG image data into raw image data:

static DataHandle read_entire_png_image(
	const png_structp png_ptr, 
	const png_infop info_ptr, 
	const png_uint_32 height) 
{
	const png_size_t row_size = png_get_rowbytes(png_ptr, info_ptr);
	const int data_length = row_size * height;
	assert(row_size > 0);

	png_byte* raw_image = malloc(data_length);
	assert(raw_image != NULL);

	png_byte* row_ptrs[height];

	png_uint_32 i;
	for (i = 0; i < height; i++) {
		row_ptrs[i] = raw_image + i * row_size;
	}

	png_read_image(png_ptr, &row_ptrs[0]);

	return (DataHandle) {raw_image, data_length};
}

First, we allocate a block of memory large enough to hold the decoded image data. Since libpng wants to decode things line by line, we also need to setup an array on the stack that contains a set of pointers into this image data, one pointer for each line. We can then call png_read_image() to decode all of the PNG data and then we return that as a DataHandle.

Let’s add the last helper method:

static GLenum get_gl_color_format(const int png_color_format) {
	assert(png_color_format == PNG_COLOR_TYPE_GRAY
	    || png_color_format == PNG_COLOR_TYPE_RGB_ALPHA
	    || png_color_format == PNG_COLOR_TYPE_GRAY_ALPHA);

	switch (png_color_format) {
		case PNG_COLOR_TYPE_GRAY:
			return GL_LUMINANCE;
		case PNG_COLOR_TYPE_RGB_ALPHA:
			return GL_RGBA;
		case PNG_COLOR_TYPE_GRAY_ALPHA:
			return GL_LUMINANCE_ALPHA;
	}

	return 0;
}

This function will read in the PNG color format and return the matching OpenGL color format. We expect that after the transformations that we did, the PNG color format will be either PNG_COLOR_TYPE_GRAY, PNG_COLOR_TYPE_GRAY_ALPHA, or PNG_COLOR_TYPE_RGB_ALPHA, so we assert against those types.

To wrap up our image loading code, we just need to add the release method:

void release_raw_image_data(const RawImageData* data) {
	assert(data != NULL);
	free((void*)data->data);
}

We’ll call this when we’re done with the raw data and can return the associated memory to the heap.

The benefits of using libpng versus platform-specific code

At this point, you might be asking why we simply didn’t use what each platform offers us, such as BitmapFactory.decode??? on Android, where ??? is one of the decode methods. Using platform specific code means that we would have to duplicate the code for each platform, so on Android we would wrap some code around BitmapFactory, and on the other platforms we would do something else. This might be a good idea if the platform-specific code was better at the job; however, in personal testing on the Nexus 7, using BitmapFactory actually seems to be a lot slower than just using libpng directly.

Here were the timings I observed for loading a single PNG file from the assets folder and uploading it into an OpenGL texture:

iPhone 5, libpng:       ~28ms
Nexus 7, libpng:        ~35ms
Nexus 7, BitmapFactory: ~93ms

 
To reduce possible sources of slowdown, I avoided JNI and had the Java code upload the data directly into a texture, and return the texture object ID to C. I also used inScaled = false and placed the image in the assets folder to avoid extra scaling; if someone has extra insight into this issue, I would definitely love to hear it! I can only surmise that there must be a lot of extra stuff going on behind the scenes, or that the overhead of doing this from Java using the Dalvik VM is just so great that it results in that much of a slowdown. The Nexus 7 is a powerful Android device, so these timings are going to be much worse on slower Android devices. Since libpng is faster than the platform-specific alternative, at least on Android, and since maintaining one set of code is easier than maintaining separate code for each platform, I’ve decided to just use libpng on all platforms for PNG image decoding.

Just for fun, here are the emscripten numbers on a MacBook Air with a 1.7 GHz Intel Core i5 and 4GB 1333 Mhz DDR3 RAM, loading an uncompressed HTML with embedded resources from the local filesystem:

Chrome 28, first time: ~318ms
Chrome 28, reload: ~67ms
Firefox 22: ~27ms

Interestingly enough, the code ran faster when it was compiled without the closure compiler and LLVM LTO.

Wrapping up the rest of the changes to the core folder

Let’s wrap up the rest of the changes to the core folder by adding the following files:

config.h:

#define LOGGING_ON 1

We’ll use this to control whether logging should be turned on or off.

macros.h:

#define UNUSED(x) (void)(x)

This will help us suppress compiler warnings related to unused parameters, which is useful for JNI methods which get called by Java.

asset_utils.h

#include "platform_gl.h"

GLuint load_png_asset_into_texture(const char* relative_path);
GLuint build_program_from_assets(
	const char* vertex_shader_path, const char* fragment_shader_path);

We’ll use these helper methods in game.c to make it easier to load in the texture and shaders.

asset_utils.c

#include "asset_utils.h"
#include "image.h"
#include "platform_asset_utils.h"
#include "shader.h"
#include "texture.h"
#include <assert.h>
#include <stdlib.h>

GLuint load_png_asset_into_texture(const char* relative_path) {
	assert(relative_path != NULL);

	const FileData png_file = get_asset_data(relative_path);
	const RawImageData raw_image_data = 
		get_raw_image_data_from_png(png_file.data, png_file.data_length);
	const GLuint texture_object_id = load_texture(
		raw_image_data.width, raw_image_data.height, 
		raw_image_data.gl_color_format, raw_image_data.data);

	release_raw_image_data(&raw_image_data);
	release_asset_data(&png_file);

	return texture_object_id;
}

GLuint build_program_from_assets(
	const char* vertex_shader_path, const char* fragment_shader_path) {
	assert(vertex_shader_path != NULL);
	assert(fragment_shader_path != NULL);

	const FileData vertex_shader_source = get_asset_data(vertex_shader_path);
	const FileData fragment_shader_source = get_asset_data(fragment_shader_path);
	const GLuint program_object_id = build_program(
		vertex_shader_source.data, vertex_shader_source.data_length,
		fragment_shader_source.data, fragment_shader_source.data_length);

	release_asset_data(&vertex_shader_source);
	release_asset_data(&fragment_shader_source);

	return program_object_id;
}

This is the implementation for asset_utils.h. We’ll use load_png_asset_into_texture() to load a PNG file from the assets folder into an OpenGL texture, and we’ll use build_program_from_assets() to load in two shaders from the assets folder and compile and link them into an OpenGL shader program.

Updating game.c

We’ll need to update game.c to use all of the new code that we’ve added. Delete everything that’s there and replace it with the following start to our new code:

#include "game.h"
#include "asset_utils.h"
#include "buffer.h"
#include "image.h"
#include "platform_gl.h"
#include "platform_asset_utils.h"
#include "shader.h"
#include "texture.h"

static GLuint texture;
static GLuint buffer;
static GLuint program;

static GLint a_position_location;
static GLint a_texture_coordinates_location;
static GLint u_texture_unit_location;

// position X, Y, texture S, T
static const float rect[] = {-1.0f, -1.0f, 0.0f, 0.0f,
		                     -1.0f,  1.0f, 0.0f, 1.0f,
		                      1.0f, -1.0f, 1.0f, 0.0f,
		                      1.0f,  1.0f, 1.0f, 1.0f};

We’ve added our includes, a few local variables to hold the OpenGL objects and shader attribute and uniform locations, and an array of floats which contains a set of positions and texture coordinates for a rectangle that will completely fill the screen. We’ll use that to draw our texture onto the screen.

Let’s continue the code:

void on_surface_created() {
	glClearColor(0.0f, 0.0f, 0.0f, 0.0f);
}

void on_surface_changed() {
	texture = load_png_asset_into_texture("textures/air_hockey_surface.png");
	buffer = create_vbo(sizeof(rect), rect, GL_STATIC_DRAW);
	program = build_program_from_assets("shaders/shader.vsh", "shaders/shader.fsh");

	a_position_location = glGetAttribLocation(program, "a_Position");
	a_texture_coordinates_location = 
		glGetAttribLocation(program, "a_TextureCoordinates");
	u_texture_unit_location = glGetUniformLocation(program, "u_TextureUnit");
}

glClearColor() is just as we were doing it before. In on_surface_changed(), we load in a texture from textures/air_hockey_surface.png, we create a VBO from the data stored in rect, and then we build an OpenGL shader program from the shaders located at shaders/shader.vsh and shaders/shader.fsh. Once we have the program loaded, we use it to grab the attribute and uniform locations out of the shader.

We haven’t yet defined the code to load in the actual assets from the file system, since a good part of that is platform-specific. When we do, we’ll take care to set things up so that these relative paths “just work”.

Let’s complete game.c:

void on_draw_frame() {
	glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT);

	glUseProgram(program);

	glActiveTexture(GL_TEXTURE0);
	glBindTexture(GL_TEXTURE_2D, texture);
	glUniform1i(u_texture_unit_location, 0);

	glBindBuffer(GL_ARRAY_BUFFER, buffer);
	glVertexAttribPointer(a_position_location, 2, GL_FLOAT, GL_FALSE, 
		4 * sizeof(GL_FLOAT), BUFFER_OFFSET(0));
	glVertexAttribPointer(a_texture_coordinates_location, 2, GL_FLOAT, GL_FALSE, 
		4 * sizeof(GL_FLOAT), BUFFER_OFFSET(2 * sizeof(GL_FLOAT)));
	glEnableVertexAttribArray(a_position_location);
	glEnableVertexAttribArray(a_texture_coordinates_location);
	glDrawArrays(GL_TRIANGLE_STRIP, 0, 4);

	glBindBuffer(GL_ARRAY_BUFFER, 0);
}

In the draw loop, we clear the screen, set the shader program, bind the texture and VBO, setup the attributes using glVertexAttribPointer(), and then draw to the screen with glDrawArrays(). If you’ve looked at the Java tutorials before, one thing you’ll notice is that it’s a bit easier to use glVertexAttribPointer() from C than it is from Java. For one, if we were using client-side arrays, we could just pass the array without worrying about any ByteBuffers, and for two, we can use the sizeof operator to get the size of a datatype in bytes, so no need to hardcode that.

This wraps up everything for the core folder, so in the next few steps, we’re going to add in the necessary platform wrappers to get this working on Android.

Adding the common platform code

These new files should go in /airhockey/src/platform/common:

platform_file_utils.h

#pragma once
typedef struct {
	const long data_length;
	const void* data;
	const void* file_handle;
} FileData;

FileData get_file_data(const char* path);
void release_file_data(const FileData* file_data);

We’ll use this to read data from the file system on iOS and emscripten. We’ll also use FileData for our Android asset reading code. We won’t define the implementation of the functions for now since we won’t need them for Android.

platform_asset_utils.h

#include "platform_file_utils.h"

FileData get_asset_data(const char* relative_path);
void release_asset_data(const FileData* file_data);

We’ll use this to read in assets. For Android this will be specialized code since it will use the AssetManager class to read files straight from the APK file.

platform_log.h

#include "platform_macros.h"
#include "config.h"

void _debug_log_v(const char* tag, const char* text, ...) PRINTF_ATTRIBUTE(2, 3);
void _debug_log_d(const char* tag, const char* text, ...) PRINTF_ATTRIBUTE(2, 3);
void _debug_log_w(const char* tag, const char* text, ...) PRINTF_ATTRIBUTE(2, 3);
void _debug_log_e(const char* tag, const char* text, ...) PRINTF_ATTRIBUTE(2, 3);

#define DEBUG_LOG_PRINT_V(tag, fmt, ...) do { if (LOGGING_ON) _debug_log_v(tag, "%s:%d:%s(): " fmt, __FILE__, __LINE__, __func__, __VA_ARGS__); } while (0)
#define DEBUG_LOG_PRINT_D(tag, fmt, ...) do { if (LOGGING_ON) _debug_log_d(tag, "%s:%d:%s(): " fmt, __FILE__, __LINE__, __func__, __VA_ARGS__); } while (0)
#define DEBUG_LOG_PRINT_W(tag, fmt, ...) do { if (LOGGING_ON) _debug_log_w(tag, "%s:%d:%s(): " fmt, __FILE__, __LINE__, __func__, __VA_ARGS__); } while (0)
#define DEBUG_LOG_PRINT_E(tag, fmt, ...) do { if (LOGGING_ON) _debug_log_e(tag, "%s:%d:%s(): " fmt, __FILE__, __LINE__, __func__, __VA_ARGS__); } while (0)

#define DEBUG_LOG_WRITE_V(tag, text) DEBUG_LOG_PRINT_V(tag, "%s", text)
#define DEBUG_LOG_WRITE_D(tag, text) DEBUG_LOG_PRINT_D(tag, "%s", text)
#define DEBUG_LOG_WRITE_W(tag, text) DEBUG_LOG_PRINT_W(tag, "%s", text)
#define DEBUG_LOG_WRITE_E(tag, text) DEBUG_LOG_PRINT_E(tag, "%s", text)

#define CRASH(e) DEBUG_LOG_WRITE_E("Assert", #e); __builtin_trap()

This contains a bunch of macros to help us do logging from our core game code. CRASH() is a special macro that will log the message passed to it, then call __builtin_trap() to stop execution. We used this macro above when we were loading in the PNG file.

platform_macros.h

#if defined(__GNUC__)
#define PRINTF_ATTRIBUTE(format_pos, arg_pos) __attribute__((format(printf, format_pos, arg_pos)))
#else
#define PRINTF_ATTRIBUTE(format_pos, arg_pos)
#endif

This is a special macro that helps the compiler do format checking when checking the formats that we pass to our log functions.

Updating the Android code

For the Android target, we have a bit of cleanup to do first. Let’s open up the Android project in Eclipse, get rid of GameLibJNIWrapper.java and update RendererWrapper.java as follows:

package com.learnopengles.airhockey;

import javax.microedition.khronos.egl.EGLConfig;
import javax.microedition.khronos.opengles.GL10;

import android.content.Context;
import android.opengl.GLSurfaceView.Renderer;

import com.learnopengles.airhockey.platform.PlatformFileUtils;

public class RendererWrapper implements Renderer {	
	static {
		System.loadLibrary("game");		
	}
	
	private final Context context;	
	
	public RendererWrapper(Context context) {
		this.context = context;
	}
	
	@Override
	public void onSurfaceCreated(GL10 gl, EGLConfig config) {		
		PlatformFileUtils.init_asset_manager(context.getAssets());
		on_surface_created();
	}

	@Override
	public void onSurfaceChanged(GL10 gl, int width, int height) {
		on_surface_changed(width, height);
	}

	@Override
	public void onDrawFrame(GL10 gl) {
		on_draw_frame();
	}
	
	private static native void on_surface_created();

	private static native void on_surface_changed(int width, int height);

	private static native void on_draw_frame();
}

We’ve moved the native methods into RendererWrapper itself. The new RendererWrapper wants a Context passed into its contructor, so give it one by updating the constructor call in MainActivity.java as follows:

glSurfaceView.setRenderer(new RendererWrapper(this));

For Android, we’ll be using the AssetManager to read in assets that are compiled directly into the APK file. We’ll need a way to pass a reference to the AssetManager to our C code, so let’s create a new class in a new package called com.learnopengles.airhockey.platform called PlatformFileUtils, and add the following code:

package com.learnopengles.airhockey.platform;

import android.content.res.AssetManager;

public class PlatformFileUtils {
	public static native void init_asset_manager(AssetManager assetManager);	
}

We are calling init_asset_manager() from RendererWrapper.onSurfaceCreated(), which you can see just a few lines above.

Updating the JNI code

We’ll also need to add platform-specific JNI code to the jni folder in the android folder. Let’s start off with platform_asset_utils.c:

#include "platform_asset_utils.h"
#include "macros.h"
#include "platform_log.h"
#include <android/asset_manager_jni.h>
#include <assert.h>

static AAssetManager* asset_manager;

JNIEXPORT void JNICALL Java_com_learnopengles_airhockey_platform_PlatformFileUtils_init_1asset_1manager(
	JNIEnv * env, jclass jclazz, jobject java_asset_manager) {
	UNUSED(jclazz);
	asset_manager = AAssetManager_fromJava(env, java_asset_manager);
}

FileData get_asset_data(const char* relative_path) {
	assert(relative_path != NULL);
	AAsset* asset = 
		AAssetManager_open(asset_manager, relative_path, AASSET_MODE_STREAMING);
	assert(asset != NULL);

	return (FileData) { AAsset_getLength(asset), AAsset_getBuffer(asset), asset };
}

void release_asset_data(const FileData* file_data) {
	assert(file_data != NULL);
	assert(file_data->file_handle != NULL);
	AAsset_close((AAsset*)file_data->file_handle);
}

We use get_asset_data() to wrap Android’s native asset manager and return the data to the calling code, and we release the data when release_asset_data() is called. The advantage of doing things like this is that the asset manager can choose to optimize data loading by mapping the file into memory, and we can return that mapped data directly to the caller.

Let’s add the logging code:

platform_log.c

#include "platform_log.h"
#include <android/log.h>
#include <stdio.h>
#include <stdlib.h>

#define ANDROID_LOG_VPRINT(priority)	\
va_list arg_ptr; \
va_start(arg_ptr, fmt); \
__android_log_vprint(priority, tag, fmt, arg_ptr); \
va_end(arg_ptr);

void _debug_log_v(const char *tag, const char *fmt, ...) {
	ANDROID_LOG_VPRINT(ANDROID_LOG_VERBOSE);
}

void _debug_log_d(const char *tag, const char *fmt, ...) {
	ANDROID_LOG_VPRINT(ANDROID_LOG_DEBUG);
}

void _debug_log_w(const char *tag, const char *fmt, ...) {
	ANDROID_LOG_VPRINT(ANDROID_LOG_WARN);
}

void _debug_log_e(const char *tag, const char *fmt, ...) {
	ANDROID_LOG_VPRINT(ANDROID_LOG_ERROR);
}

This code wraps Android’s native logging facilities.

Finally, let’s rename jni.c to renderer_wrapper.c and update it to the following:

#include "game.h"
#include "macros.h"
#include <jni.h>

/* These functions are called from Java. */

JNIEXPORT void JNICALL Java_com_learnopengles_airhockey_RendererWrapper_on_1surface_1created(
	JNIEnv * env, jclass cls) {
	UNUSED(env);
	UNUSED(cls);
	on_surface_created();
}

JNIEXPORT void JNICALL Java_com_learnopengles_airhockey_RendererWrapper_on_1surface_1changed(
	JNIEnv * env, jclass cls, jint width, jint height) {
	UNUSED(env);
	UNUSED(cls);
	on_surface_changed();
}

JNIEXPORT void JNICALL Java_com_learnopengles_airhockey_RendererWrapper_on_1draw_1frame(
	JNIEnv* env, jclass cls) {
	UNUSED(env);
	UNUSED(cls);
	on_draw_frame();
}

Nothing has really changed here; we just use the UNUSED() macro (defined earlier in macros.h in the core folder) to suppress some unnecessary compiler warnings.

Updating the NDK build files

We’re almost ready to build & test, just a few things left to be done. Download libpng 1.6.2 from http://www.libpng.org/pub/png/libpng.html and place it in /src/3rdparty/libpng. To configure libpng, copy pnglibconf.h.prebuilt from libpng/scripts/ to libpng/ and remove the .prebuilt extension.

To compile libpng with the NDK, let’s add a build script called Android.mk to the libpng folder, as follows:

LOCAL_PATH := $(call my-dir)

include $(CLEAR_VARS)

LOCAL_MODULE := libpng
LOCAL_SRC_FILES = png.c \
				  pngerror.c \
				  pngget.c \
				  pngmem.c \
				  pngpread.c \
				  pngread.c \
				  pngrio.c \
				  pngrtran.c \
				  pngrutil.c \
				  pngset.c \
				  pngtrans.c \
				  pngwio.c \
				  pngwrite.c \
				  pngwtran.c \
				  pngwutil.c
LOCAL_EXPORT_C_INCLUDES := $(LOCAL_PATH)
LOCAL_EXPORT_LDLIBS := -lz

include $(BUILD_STATIC_LIBRARY)

This build script will tell the NDK tools to build a static library called libpng that is linked against zlib, which is built into Android. It also sets up the right variables so that we can easily import this library into our own projects, and we won’t even have to do anything special because the right includes and libs are already exported.

Let’s also update the Android.mk file in our jni folder:

LOCAL_PATH := $(call my-dir)
PROJECT_ROOT_PATH := $(LOCAL_PATH)/../../../
CORE_RELATIVE_PATH := ../../../core/

include $(CLEAR_VARS)

LOCAL_MODULE    := game
LOCAL_CFLAGS    := -Wall -Wextra
LOCAL_SRC_FILES := platform_asset_utils.c \
                   platform_log.c \
                   renderer_wrapper.c \
                   $(CORE_RELATIVE_PATH)/asset_utils.c \
                   $(CORE_RELATIVE_PATH)/buffer.c \
                   $(CORE_RELATIVE_PATH)/game.c \
                   $(CORE_RELATIVE_PATH)/image.c \
                   $(CORE_RELATIVE_PATH)/shader.c \
                   $(CORE_RELATIVE_PATH)/texture.c \
                  
LOCAL_C_INCLUDES := $(PROJECT_ROOT_PATH)/platform/common/
LOCAL_C_INCLUDES += $(PROJECT_ROOT_PATH)/core/
LOCAL_STATIC_LIBRARIES := libpng
LOCAL_LDLIBS := -lGLESv2 -llog -landroid

include $(BUILD_SHARED_LIBRARY)

$(call import-add-path,$(PROJECT_ROOT_PATH)/3rdparty)
$(call import-module,libpng)

Our new build script links in the new files that we’ve created in core, and it also imports libpng from the 3rdparty folder and builds it as a static library that is then linked into our Android application.

Adding in the assets

The last step is to add in the assets into /airhockey/assets, which includes the textures and the shaders. To do this, download the assets from https://github.com/learnopengles/airhockey/tree/article-2-loading-png-file/assets and place them in your airhockey folder. To have them automatically included in the APK, follow these steps:

  1. Delete the existing assets folder from the project.
  2. Right-click the project and select Properties. In the window that appears, select Resource->Linked Resources and click New….
  3. Enter ‘ASSETS_LOC’ as the name, and ‘${PROJECT_LOC}/../../../assets’ as the location. Once that’s done, click OK until the Properties window is closed.
  4. Right-click the project again and select New->Folder, enter ‘assets’ as the name, select Advanced, select Link to alternate location (Linked Folder), select Variables…, select ASSETS_LOC, and select OK, then Finish.

You should now have a new assets folder that is linked to the assets folder that we created in the airhockey root. More information can be found on Stack Overflow: How to link assets/www folder in Eclipse / Phonegap / Android project?

Running the app

We should be able to check out the new code now. If you run the app on your Android emulator or device, it should look similar to the following image:

Texture on the Nexus 7

The texture looks a bit stretched/squashed, because we are currently asking OpenGL to fill the screen with that texture. With a basic framework in place, we can start adding some more detail in future lessons and start turning this into an actual game.

Debugging NDK code

While developing this project, I had to hook up a debugger as something was going bad in the PNG loading code, and I just wasn’t sure what. It turns out that I had confused a png_bytep* with a png_byte* — the ‘p’ in the first one means that it’s already a pointer, so I didn’t have to put another star there. I had some issues using the debugging at first, so here are some tips that might help you out if you want to hook up the debugger:

  1. Your project absolutely cannot have any spaces in its path. Otherwise, the debugger will inexplicably fail to connect.
  2. The native code needs to be built with NDK_DEBUG=1; see “Debugging native applications” on this page: Using the NDK plugin.
  3. Android will not wait for gdb to connect before executing the code. Add SystemClock.sleep(10000); to RendererWrapper’s onSurfaceCreated() method to add a sufficient delay to hit your breakpoints.

Once that’s done, you can start debugging from Eclipse by right-clicking the project and selecting Debug As->Android Native Application.

Exploring further

The full source code for this lesson can be found at the GitHub project. For a “friendlier” introduction to OpenGL ES 2 that is focused on Java and Android, see Android Lesson One: Getting Started or OpenGL ES 2 for Android: A Quick-Start Guide.

What could we do to further streamline the code? If we were using C++, we could take advantage of destructors to create, for example, a FileData that cleans itself up when it goes out of scope. I’d also like to make the structs private somehow, as their internals don’t really need to be exposed to clients. What else would you do?

Further reading

In the next two posts, we’ll look at adding support for iOS and emscripten. Now that we’ve built up this base, it actually won’t take too much work!

Calling OpenGL from C on Android, Using the NDK

For this first post in the Developing a Simple Game of Air Hockey Using C++ and OpenGL ES 2 for Android, iOS, and the Web series, we’ll create a simple Android program that initializes OpenGL, then renders simple frames from native code.

Prerequisites

  • The Android SDK & NDK installed, along with a suitable IDE.
  • An emulator or a device supporting OpenGL ES 2.0.

We’ll be using Eclipse in this lesson.

To prepare and test the code for this article, I used revision 22.0.1 of the ADT plugin and SDK tools, and revision 17 of the platform and build tools, along with revision 8e of the NDK and Eclipse Juno Service Pack 2.

Getting started

The first thing to do is create a new Android project in Eclipse, with support for the NDK. You can follow along all of the code at the GitHub project.

Before creating the new project, create a new folder called airhockey, and then create a new Git repository in that folder. Git is a source version control system that will help you keep track of changes to the source and to roll back changes if anything goes wrong. To learn more about how to use Git, see the Git documentation.

To create a new project, select File->New->Android Application Project, and then create a new project called ‘AirHockey’, with the application name set to ‘Air Hockey’ and the package name set to ‘com.learnopengles.airhockey’. Leaving the rest as defaults or filling out as you prefer, save this new project in a new folder called android, inside of the airhockey folder that we created in the previous step.

Once the project has been created, right-click on the project in the Package Explorer, select Android Tools from the drop-down menu, then select Add Native Support…. When asked for the Library Name, enter ‘game’ and hit Finish, so that the library will be called libgame.so. This will create a new folder called jni in the project tree.

Initializing OpenGL

With our project created, we can now edit the default activity and configure it to initialize OpenGL. We’ll first add two member variables to the top of our activity class:

	private GLSurfaceView glSurfaceView;
	private boolean rendererSet;

Now we can set the body of onCreate() as follows:

	@Override
	protected void onCreate(Bundle savedInstanceState) {
		super.onCreate(savedInstanceState);

		ActivityManager activityManager
			= (ActivityManager) getSystemService(Context.ACTIVITY_SERVICE);
		ConfigurationInfo configurationInfo = activityManager.getDeviceConfigurationInfo();

		final boolean supportsEs2 =
			configurationInfo.reqGlEsVersion >= 0x20000 || isProbablyEmulator();

		if (supportsEs2) {
			glSurfaceView = new GLSurfaceView(this);

			if (isProbablyEmulator()) {
				// Avoids crashes on startup with some emulator images.
				glSurfaceView.setEGLConfigChooser(8, 8, 8, 8, 16, 0);
			}

			glSurfaceView.setEGLContextClientVersion(2);
			glSurfaceView.setRenderer(new RendererWrapper());
			rendererSet = true;
			setContentView(glSurfaceView);
		} else {
			// Should never be seen in production, since the manifest filters
			// unsupported devices.
			Toast.makeText(this, "This device does not support OpenGL ES 2.0.",
					Toast.LENGTH_LONG).show();
			return;
		}
	}

First we check if the device supports OpenGL ES 2.0, and then if it does, we initialize a new GLSurfaceView and configure it to use OpenGL ES 2.0.

The check for configurationInfo.reqGlEsVersion >= 0x20000 doesn’t work on the emulator, so we also call isProbablyEmulator() to see if we’re running on an emulator. Let’s define that method as follows:

	private boolean isProbablyEmulator() {
		return Build.VERSION.SDK_INT >= Build.VERSION_CODES.ICE_CREAM_SANDWICH_MR1
				&& (Build.FINGERPRINT.startsWith("generic")
						|| Build.FINGERPRINT.startsWith("unknown")
						|| Build.MODEL.contains("google_sdk")
						|| Build.MODEL.contains("Emulator")
						|| Build.MODEL.contains("Android SDK built for x86"));
	}

OpenGL ES 2.0 will only work in the emulator if it’s been configured to use the host GPU. For more info, read Android Emulator Now Supports Native OpenGL ES2.0!

Let’s complete the activity by adding the following methods:

	@Override
	protected void onPause() {
		super.onPause();

		if (rendererSet) {
			glSurfaceView.onPause();
		}
	}

	@Override
	protected void onResume() {
		super.onResume();

		if (rendererSet) {
			glSurfaceView.onResume();
		}
	}

We need to handle the Android lifecycle, so we also pause & resume the GLSurfaceView as needed. We only do this if we’ve also called glSurfaceView.setRenderer(); otherwise, calling these methods will cause the application to crash.

For a more detailed introduction to OpenGL ES 2, see Android Lesson One: Getting Started or OpenGL ES 2 for Android: A Quick-Start Guide.

Adding a default renderer

Create a new class called RendererWrapper, and add the following code:

public class RendererWrapper implements Renderer {
	@Override
	public void onSurfaceCreated(GL10 gl, EGLConfig config) {
		glClearColor(0.0f, 0.0f, 1.0f, 0.0f);
	}

	@Override
	public void onSurfaceChanged(GL10 gl, int width, int height) {
		// No-op
	}

	@Override
	public void onDrawFrame(GL10 gl) {
		glClear(GL_COLOR_BUFFER_BIT);
	}
}

This simple renderer will set the clear color to blue and clear the screen on every frame. Later on, we’ll change these methods to call into native code. To call methods like glClearColor() without prefixing them with GLES20, add import static android.opengl.GLES20.*; to the top of the class file, then select Source->Organize Imports.

If you have any issues in getting the code to compile, ensure that you’ve organized all imports, and that you’ve included the following imports in RendererWrapper:

import javax.microedition.khronos.egl.EGLConfig;
import javax.microedition.khronos.opengles.GL10;

import android.opengl.GLSurfaceView.Renderer;

Updating the manifest to exclude unsupported devices

We should also update the manifest to make sure that we exclude devices that don’t support OpenGL ES 2.0. Add the following somewhere inside AndroidManifest.xml:

    <uses-feature
        android:glEsVersion="0x00020000"
        android:required="true" />

Since OpenGL ES 2.0 is only fully supported from Android Gingerbread 2.3.3 (API 10), replace any existing <uses-sdk /> tag with the following:

    <uses-sdk
        android:minSdkVersion="10"
        android:targetSdkVersion="17" />

If we run the app now, we should see a blue screen as follows:

First pass
First pass

Adding native code

We’ve verified that things work from Java, but what we really want to do is to be using OpenGL from native code! In the next few steps, we’ll move the OpenGL code to a set of C files and setup an NDK build for these files.

We’ll be sharing this native code with our future projects for iOS and the web, so let’s create a folder called common located one level above the Android project. What this means is that in your airhockey folder, you should have one folder called android, containing the Android project, and a second folder called common which will contain the common code.

Linking a relative folder that lies outside of the project’s base folder is unfortunately not the easiest thing to do in Eclipse. To accomplish this, we’ll have to follow these steps:

  1. Right-click the project and select Properties. In the window that appears, select Resource->Linked Resources and click New….
  2. Enter ‘COMMON_SRC_LOC’ as the name, and ‘${PROJECT_LOC}\..\common’ as the location. Once that’s done, click OK until the Properties window is closed.
  3. Right-click the project again and select Build Path->Link Source…, select Variables…, select COMMON_SRC_LOC, and select OK. Enter ‘common’ as the folder name and select Finish, then close the Properties window.

You should now see a new folder in your project called common, linked to the folder that we created.

Let’s create two new files in the common folder, game.c and game.h. You can create these files by right-clicking on the folder and selecting New->File. Add the following to game.h:

void on_surface_created();
void on_surface_changed();
void on_draw_frame();

In C, a .h file is known as a header file, and can be considered as an interface for a given .c source file. This header file defines three functions that we’ll be calling from Java.

Let’s add the following implementation to game.c:

#include "game.h"
#include "glwrapper.h"

void on_surface_created() {
	glClearColor(1.0f, 0.0f, 0.0f, 0.0f);
}

void on_surface_changed() {
	// No-op
}

void on_draw_frame() {
	glClear(GL_COLOR_BUFFER_BIT);
}

This code will set the clear color to red, and will clear the screen every time on_draw_frame() is called. We’ll use a special header file called glwrapper.h to wrap the platform-specific OpenGL libraries, as they are often located at a different place for each platform.

Adding platform-specific code and JNI code

To use this code, we still need to add two things: a definition for glwrapper.h, and some JNI glue code so that we can call our C code from Java. JNI stands for Java Native Interface, and it’s how C and Java can talk to each other on Android.

Inside your project, create a new file called glwrapper.h in the jni folder, with the following contents:

#include <GLES2/gl2.h>

That wraps Android’s OpenGL headers. To create the JNI glue, we’ll first need to create a Java class that exposes the native interface that we want. To do this, let’s create a new class called GameLibJNIWrapper, with the following code:

public class GameLibJNIWrapper {
	static {
		System.loadLibrary("game");
	}

	public static native void on_surface_created();

	public static native void on_surface_changed(int width, int height);

	public static native void on_draw_frame();
}

This class will load the native library called libgame.so, which is what we’ll be calling our native library later on when we create the build scripts for it. To create the matching C file for this class, build the project, open up a command prompt, change to the bin/classes folder of your project, and run the following command:

javah -o ../../jni/jni.c com.learnopengles.airhockey.GameLibJNIWrapper

The javah command should be located in your JDKs bin directory. This command will create a jni.c file that will look very messy, with a bunch of stuff that we don’t need. Let’s simplify the file and replace it with the following contents:

#include "../../common/game.h"
#include <jni.h>

JNIEXPORT void JNICALL Java_com_learnopengles_airhockey_GameLibJNIWrapper_on_1surface_1created
	(JNIEnv * env, jclass cls) {
	on_surface_created();
}

JNIEXPORT void JNICALL Java_com_learnopengles_airhockey_GameLibJNIWrapper_on_1surface_1changed
	(JNIEnv * env, jclass cls, jint width, jint height) {
	on_surface_changed();
}

JNIEXPORT void JNICALL Java_com_learnopengles_airhockey_GameLibJNIWrapper_on_1draw_1frame
	(JNIEnv * env, jclass cls) {
	on_draw_frame();
}

We’ve simplified the file greatly, and we’ve also added a reference to game.h so that we can call our game methods. Here’s how it works:

  1. GameLibJNIWrapper defines the native C functions that we want to be able to call from Java.
  2. To be able to call these C functions from Java, they have to be named in a very specific way, and each function also has to have at least two parameters, with a pointer to a JNIEnv as the first parameter, and a jclass as the second parameter. To make life easier, we can use javah to create the appropriate function signatures for us in a file called jni.c.
  3. From jni.c, we call the functions that we declared in game.h and defined in game.c. That completes the connections and allows us to call our native functions from Java.

Compiling the native code

To compile and run the native code, we need to describe our native sources to the NDK build system. We’ll do this with two files that should go in the jni folder: Android.mk and Application.mk. When we added native support to our project, a file called game.cpp was automatically created in the jni folder. We won’t be needing this file, so you can go ahead and delete it.

Let’s set Android.mk to the following contents:

LOCAL_PATH := $(call my-dir)

include $(CLEAR_VARS)

LOCAL_MODULE    := game
LOCAL_CFLAGS    := -Wall -Wextra
LOCAL_SRC_FILES := ../../common/game.c jni.c
LOCAL_LDLIBS := -lGLESv2

include $(BUILD_SHARED_LIBRARY)

This file describes our sources, and tells the NDK that it should compile game.c and jni.c and build them into a shared library called libgame.so. This shared library will be dynamically linked with libGLESv2.so at runtime.

When specifying this file, be careful not to leave any trailing spaces after any of the commands, as this may cause the build to fail.

The next file, Application.mk, should have the following contents:

APP_PLATFORM := android-10
APP_ABI := armeabi-v7a

This tells the NDK build system to build for Android API 10, so that it doesn’t complain about us using unsupported features not present in earlier versions of Android, and it also tells the build system to generate a library for the ARMv7-A architecture, which supports hardware floating point and which most newer Android devices use.

Updating RendererWrapper

Before we can see our new changes, we have to update RendererWrapper to call into our native code. We can do that by updating it as follows:

	@Override
	public void onSurfaceCreated(GL10 gl, EGLConfig config) {
		GameLibJNIWrapper.on_surface_created();
	}

	@Override
	public void onSurfaceChanged(GL10 gl, int width, int height) {
		GameLibJNIWrapper.on_surface_changed(width, height);
	}

	@Override
	public void onDrawFrame(GL10 gl) {
		GameLibJNIWrapper.on_draw_frame();
	}

The renderer now calls our GameLibJNIWrapper class, which calls the native functions in jni.c, which calls our game functions defined in game.c.

Building and running the application

You should now be able to build and run the application. When you build the application, a new shared library called libgame.so should be created in your project’s /libs/armeabi-v7a/ folder. When you run the application, it should look as follows:

Second pass
Second pass

We know that our native code is being called with the color changing from blue to red.

Exploring further

The full source code for this lesson can be found at the GitHub project. For a more detailed introduction to OpenGL ES 2, see Android Lesson One: Getting Started or OpenGL ES 2 for Android: A Quick-Start Guide.

In the next part of this series, we’ll create an iOS project and we’ll see how easy it is to reuse our code from the common folder and wrap it up in Objective-C. Please let me know if you have any questions or feedback!