A performance comparison redux: Java, C, and Renderscript on the Nexus 5

In my previous post on this topic, A performance comparison between Java and C on the Nexus 5, I compared the performance of an audio low-pass filter in Java and C. The results were clear: The C version outperformed, and by a significant amount. This result brought more attention to the post than I was expecting; some of you were also curious about RenderScript, and I’m pleased to say that Jean-Luc Brouillet, a member of Google’s RenderScript team, got in touch with me and generously volunteered an implementation of the DSP code in RenderScript.

With this new code, I refactored the code into a new benchmark with test audio data, so that I could compare the different implementations and verify their output. I’ll be sharing both the code and the results with you today.

Motivations and intentions

Some of you might be curious about why I am so interested in this subject. 🙂 I normally spend most of my development hours coding for Android, using Java; in fact, my first book, OpenGL ES 2 for Android: A Quick-Start Guide, is a beginner’s guide to OpenGL that focuses on Android and Java.

Normally, when I develop code, the most important questions on my mind are: “Is this easy to maintain?” “Is it correct?” “If I come back and revisit this code a month later, am I going to understand what the heck I was doing?” Since Java is the primary development language on Android, it just makes sense for me to do most of my development there.

So why the recent focus on native development? Here are two big reasons:

  • The performance of Java on Android isn’t suitable for everything. For critical performance paths, it can be a big competitive advantage to move that code over to native, so that it completes in less time and uses less battery.
  • I’m interested in branching out to other platforms down the road, probably starting with iOS, and I’m curious if it makes sense to share some code between iOS and Android using a common code base in C/C++. It’s important that this code runs without many abstractions in the way, so I’m not very interested in custom/proprietary solutions like Xamarin’s C# or an HTML5-based toolkit.

It’s starting to become clear to me that it can make sense to work with more than one language, and to choose these languages in situations where the benefits outweigh the cost. Trying to work with Android’s UI toolkit from C++ is painful; running a DSP filter from Java and watching it use more battery and take more time than it needs to is just as painful.

Our new test scenario

For this round of benchmarks, we’ll be comparing several different implementations of a low-pass IIR filter, implemented with coefficients generated with mkfilter. We’ll run a test audio file through each implementation, and record the best score for each.

How does the test work?

  1. First, we load a test audio file into memory.
  2. We then execute the DSP algorithm over the test audio, benchmarking the elapsed time. The data is processed in chunks to reflect the fact that this is similar to how we would process data coming off of the microphone.
  3. The results are written to a new audio file in the device’s default storage, under “PerformanceTest/”.

Here are our test implementations:

  1. Java. This is a straightforward implementation of the algorithm.
  2. Java (tuned). This is the same as 1, but with all of the functions manually inlined.
  3. C. This uses the Java Native Interface (JNI) to pass the data back and forth.
  4. RenderScript. A big thank you to Mr. Brouillet from the RenderScript team for taking the time to contribute this!

The tests were run on a Nexus 5 device running Android 4.4.3. Here are the results:

Results

Implementation Execution environment Compiler Shorts/second Relative run time
(lower is better)
C Dalvik JNI gcc 4.6 17,639,081 1.00
C Dalvik JNI gcc 4.8 16,516,757 1.07
RenderScript Dalvik RenderScript (API 19) 15,206,495 1.16
RenderScript Dalvik RenderScript (API 18) 13,234,397 1.33
C Dalvik JNI clang 3.4 13,208,408 1.34
Java (tuned) Art (Proguard) 7,235,607 2.44
Java (tuned) Art 7,097,363 2.49
Java (tuned) Dalvik 5,678,990 3.11
Java (tuned) Dalvik (Proguard) 5,365,814 3.29
Java Art (Proguard) 3,512,426 5.02
Java Art 3,049,986 5.78
Java Dalvik (Proguard) 1,220,600 14.45
Java Dalvik 1,083,015 16.29

 

For this test, the C implementation is the king of the hill, with gcc 4.6 giving the best performance. The gcc compiler is followed by RenderScript and clang 3.4, and the two Java implementations are at the back of the pack, with Dalvik giving the worst performance.

C

The C implementation compiled with gcc gave the best performance out of the entire group. All tests were done with -ffast-math and -O3, using the NDK r9d. Switching between Dalvik and ART had no impact on the C run times.

I’m not sure why there is still a large gap between clang and gcc; would everything on iOS run that much faster if Apple was using gcc?  Clang will likely continue to improve and I hope to see this gap closed in the future. I’m also curious about why gcc 4.6 seems to generate better code than 4.8. Perhaps someone familiar with ARM assembly and the compilers would be able to weigh in why?

Even though I’m a newbie at C and I learned about JNI in part by doing these benchmarks, I didn’t find the code overly difficult to write. There’s enough documentation out there that I was able to figure things out, and the algorithm output matches that of the other implementations; however, since C is an unsafe language, I’m not entirely convinced that I haven’t stumbled into undefined behaviour or otherwise done something insane. 🙂

RenderScript

In the previous post, someone asked about RenderScript, so I started working on an implementation. Unfortunately, I had zero experience with RenderScript at the time so I wasn’t able to get it working. Luckily, Jean-Luc Brouillet from the RenderScript team also saw the post and ported over the algorithm for me!

As you can see, the results are very promising: RenderScript offers better performance than clang and almost the same performance as gcc, without requiring use of the NDK or of JNI glue! As with C, switching between Dalvik and ART had no impact on the run times.

RenderScript also offers the possibility to easily parallelize the code and/or run it on the GPU which can potentially give a huge speedup, though unfortunately we weren’t able to take advantage of that here since this particular DSP algorithm is not trivially parallelizable. However, for other algorithms like a simple gain, RenderScript can give a significant boost with small changes to the code, and without having to worry about threading or other such headaches.

In my humble view, the RenderScript implementation does need some more polishing and the documentation needs to be significantly improved, as I doubt I would have gotten it working on my own without help. Here are some of the issues that I ran into with the RenderScript port:

  • Not all functions are documented. For example, the algorithm uses rsSetElementAt_short() which I can’t find anywhere except for some obscure C files in the Android source code.
  • The allocation functions are missing a way to copy data into an offset of an array. To work around this, I use a scratch buffer and System.arraycopy() to move the data around, and to keep things fair, I changed the other implementations to work in the same way. While this slows them down slightly, I don’t believe it’s an unfair advantage for RenderScript, because in real-world usage, I would expect to process the data coming off the microphone and write that directly into a file, not into an offset of some array.
  • The fastest RenderScript implementation only works on Android 4.4 KitKat devices. Going down one version to Android 4.3 changes the RenderScript API which requires me to change the code slightly, slowing things down for both 4.3 and 4.4. RenderScript does offer a “support” mode via the support API which should enable backwards compatibility, but I wasn’t able to get this to work for me for APIs older than 18 (Android 4.3).

So while there are some issues with RenderScript as implemented today, these are all issues that can hopefully be fixed. RenderScript also has the significant advantage of running code on the CPU and GPU in parallel, and doesn’t require JNI glue code. This makes it a serious contender to C, especially if portability to older devices or other platforms is not a big concern.

Java

As with last time, Java fills out the bottom of the pack. The performance is especially terrible with the default Dalvik implementation; in fact, it would be even worse if I hadn’t manually replaced the modulo operator with a bit mask, which I was hoping the compiler could do with the static information available to it, but it doesn’t.

Some people asked about Proguard, so I tried it out with the following config (full details in the test project):

-optimizationpasses 5
-allowaccessmodification
-dontpreverify

-dontusemixedcaseclassnames
-dontskipnonpubliclibraryclasses
-verbose

The results were mixed. Switching between Dalvik and ART made much more of a difference, as did manually inlining all of the functions together. The best result with Dalvik was without Proguard, and was 3.11x slower than the best C implementation. The best result with ART was with Proguard, and was 2.44x slower than the best C implementation. If we compare the normal Java version to the best C result, we get a 5.02x slowdown with ART and a 14.45x slowdown with Dalvik.

It does look like the performance of Java will be getting a lot better once ART becomes widely deployed; instead of huge slowdowns, we’ll be seeing between 3x and 5x, which does make a difference. I can already see the improvements when sorting and displaying ListViews in UI code, so this isn’t just something that affects niche code like audio DSP filters.

Desktop results (just for fun)

Just like last time, again, here are some desktop results, just for fun. 🙂 These tests were run on a 2.6 GHz Intel Core i7 running OS X 10.9.3.

Implementation Execution environment Compiler Shorts/second Relative speed
(higher is better)
C Java SE 1.6u65 JNI gcc 4.9 129,909,662 7.36
C Java SE 1.6u65 JNI clang 3.4 96,022,644 5.44
Java Java SE 1.8u5 (+XX:AggressiveOpts) 82,988,332 4.70
Java (tuned) Java SE 1.8u5 (+XX:AggressiveOpts) 79,288,025 4.50
Java Java SE 1.8u5 64,964,399 3.68
Java (tuned) Java SE 1.8u5 64,748,201 3.67
Java (tuned) Java SE 1.6u65 63,965,575 3.63
Java Java SE 1.6u65 53,245,864 3.02

 

As on the Nexus 5, the C implementation compiled with gcc dominates; however, I’m very impressed with where Java ended up!

C

I used the following compilers with optimization flags -march=native -ffast-math -O3:

  • Apple LLVM version 5.1 (clang-503.0.40) (based on LLVM 3.4svn)
  • gcc version 4.9.0 20140416 (prerelease) (MacPorts gcc49 4.9-20140416_2)

As on the Nexus 5, gcc’s generated code is much faster than clang’s; perhaps this will change in the future but for now, gcc is still the king. I also find it interesting that the gap between the best run time here and the best run time on the Nexus 5 is similar to the gap between C and ART on the Nexus 5. Not so far apart, they are!

Java

I’m also impressed with the latest Java for OS X. While manually inlining all of the functions together was required for an improvement on Java 1.6, the manually-inlined version was actually slower on Java 1.8. This shows that not only is this sort of code abuse no longer required on the latest Java, but also that the compiler is smarter than we are at optimizing the code.

Adding +XX:AggressiveOpts to Java 1.8 sped things up even more, almost closing the gap with clang! That is very impressive in my eyes, since Java has an old reputation of being a slow language, but in some cases and situations, it can be almost as fast as C if not faster.

The worst Java performance is 2.43x slower than the best C performance, which is about the same relative difference as the best Java performance on Android with ART. Performance differences aren’t always just about language choice; they can also be very dependent on the quality of implementation. At this time, the Google team has made different trade-offs which place ART at around the same relative level of performance, for this specific test case, as Java 1.6. The improved performance of Java 1.8 on the desktop shows that it’s clearly possible to close up the gap on Android in the future.

Explore the code!

The project can be downloaded at GitHub: https://github.com/learnopengles/dsp-perf-test. To compile the code, download or clone the repository and import the projects into Eclipse with File->Import->Existing Projects Into Workspace. If the Android project is missing references, go to its properties, Java Build Path, Projects, and add “JavaPerformanceTest”.

The results are written to “PerformanceTest/” on the device’s default storage, so please double-check that you don’t have anything there before running the tests.

So, what do you think? Does it make sense to drop down into native code? Or are native languages a relic of the past, and there’s no reason to use anything other than modern, safe languages? I would love to hear your feedback.

A performance comparison between Java and C on the Nexus 5

Android phones have been growing ever more powerful with time, with the Nexus 5 sporting a quad-core 2.3 GHz Krait 400; this is a very powerful CPU for a mobile phone. With most Android apps being written in Java, does Java allow us to access all of that power? Or, put another way, is Java efficient enough, allowing tasks to complete more quickly and allowing the CPU to idle more, saving precious battery life?

(Note: An updated version of this comparison is available at A Performance Comparison Redux: Java, C, and Renderscript on the Nexus 5, along with source code).

In this post, I will take a look at a DSP filter adapted from coefficients generated with mkfilter, and compare three different implementations: one in C, one in Java, and one in Java with some manual optimizations. The source for these tests can be downloaded at the end of this post.

To compare the results, I ran the filter over an array of random data on the Nexus 5, and the compared the results to the fastest implementation. In the following table, a lower runtime is better, with the fastest implementation getting a relative runtime of 1.

Execution environment Options Relative runtime (lower is better)
gcc 4.8 1.00
gcc 4.8 (LOCAL_ARM_NEON := true) -ffast-math -O3 1.02
gcc 4.8 -ffast-math -O3 1.05
clang 3.4 (LOCAL_ARM_NEON := true) -ffast-math -O3 1.27
clang 3.4 -ffast-math -O3 1.42
clang 3.4 1.43
ART (manually-optimized) 2.22
Dalvik (manually-optimized) 2.87
ART (normal code) 7.99
Dalvik (normal code) 17.78

The statically-compiled C code gave the best execution times, followed by ART and then by Dalvik. The C code uses JNI via GetShortArrayRegion and SetShortArrayRegion to marshal the data from Java to C, and then back from C to Java once processing has completed.

The best performance came courtesy of GCC 4.8, with little variation between the different additional optimization options. Clang’s ARM builds are not quite as optimized as GCC’s; toggling LOCAL_ARM_NEON := true in the NDK makefile also makes a clear difference in performance.

Even the slowest native build using clang is not more than 43% slower than the best native build using gcc. Once we switch to Java, the variance starts to increase significantly, with the best runtime about 2.2x slower than native code, and the worst runtime a staggering 17.8x slower.

What explains the large difference? For one, it appears that both ART and Dalvik are limited in the amount of static optimizations that they are capable of. This is understandable in the case of Dalvik, since it uses a JIT and it’s also much older, but it is disappointing in the case of ART, since it uses ahead-of-time compilation.

Is there a way to speed up the Java code? I decided to try it out, by applying the same static optimizations I would have expected the compiler to do, like converting modulo to bit masks and inlining function calls. These changes resulted in one massive and hard to read function, but they also dramatically improved the runtime performance, with Dalvik speeding up from a 17.8x penalty to 2.9x, and ART speeding up from an 8.0x penalty to 2.2x.

The downside of this is that the code has to be abused to get this additional performance, and it still doesn’t come close to matching the ahead-of-time code generated by gcc and clang, which can surpass that performance without similar abuse of the code. The NDK is still a viable option for those looking for improved performance and more efficient code which consumes less battery over time.

Just for fun, I decided to try things out on a laptop with a 2.6 GHz Intel Core i7. For this table, the relative results are in the other direction, with 1x corresponding to the best time on the Nexus 5, 2x being twice as fast, and so on. The table starts with the best results first, as before.

Execution environment Options Relative speed (higher is better)
clang 3.4 -O3 -ffast-math -flto 8.38x
clang 3.4 -O3 -ffast-math 6.09x
Java SE 1.7u51 (manually-optimized) -XX:+AggressiveOpts 5.25x
Java SE 1.6u65 (manually-optimized) 3.85x
Java SE 1.6 (normal code) 2.44x

As on the Nexus 5, the C code runs faster, but to Java’s credit, the gap between the best & worst result is less than 4x, which is much less variance than we see with Dalvik or ART. Java 1.6 and 1.7 are very close to each other, unless “-XX:+AggressiveOpts” is used; with that option enabled, 1.7 is able to pull ahead.

There is still an unfortunate gap between the “normal” code and the manually-optimized code, which really should be closable with static analysis and inlining.

The other interesting result is that the gap between mobile and PC is closing over time, and even more so if you take power consumption into account. It’s quite impressive to see that as far as single-core performance goes, the PC and smartphone are closer than ever.

Conclusion

Recent Android devices are getting very powerful, and with the new ART runtime, common Java code can be executed quickly enough to keep user interfaces responsive and users happy.

Sometimes, though, we need to go further, and write demanding code that needs to run quickly and efficiently. With the latest Android devices, these algorithms may be able to run quickly enough in the Dalvik VM or with ART, but then we have to ask ourselves: is the benefit of using a single language worth the cost of lower performance? This isn’t just an academic question: lower performance means that we need to ask our users to give us more CPU cycles, which shortens their device’s battery life, heats up their phones, and makes them wait longer for results, and all because we didn’t want to write the code in another language.

For these reasons, writing some of our code in C/C++, FORTRAN, or another native language can still make a lot of sense.

For more reading on this topic, check out How Powerful is Your Nexus 7?

Source

dsp.c
#include "dsp.h"
#include <algorithm>
#include <cstdint>
#include <limits>

static constexpr int int16_min = std::numeric_limits<int16_t>::min();
static constexpr int int16_max = std::numeric_limits<int16_t>::max();

static inline int16_t clamp(int input)
{
     return std::max(int16_min, std::min(int16_max, input));
}

static inline int get_offset(const FilterState& filter_state, int relative_offset)
{
     return (filter_state.current + relative_offset) % filter_state.size;
}

static inline void push_sample(FilterState& filter_state, int16_t sample)
{
     filter_state.input[get_offset(filter_state, 0)] = sample;
     ++filter_state.current;
}

static inline int16_t get_output_sample(const FilterState& filter_state)
{
     return clamp(filter_state.output[get_offset(filter_state, 0)]);
}

static inline void apply_lowpass(FilterState& filter_state)
{
     double* x = filter_state.input;
     double* y = filter_state.output;

     y[get_offset(filter_state, 0)] =
       (  1.0 * (1.0 / 6.928330802e+06) * (x[get_offset(filter_state, -10)] + x[get_offset(filter_state,  -0)]))
     + ( 10.0 * (1.0 / 6.928330802e+06) * (x[get_offset(filter_state,  -9)] + x[get_offset(filter_state,  -1)]))
     + ( 45.0 * (1.0 / 6.928330802e+06) * (x[get_offset(filter_state,  -8)] + x[get_offset(filter_state,  -2)]))
     + (120.0 * (1.0 / 6.928330802e+06) * (x[get_offset(filter_state,  -7)] + x[get_offset(filter_state,  -3)]))
     + (210.0 * (1.0 / 6.928330802e+06) * (x[get_offset(filter_state,  -6)] + x[get_offset(filter_state,  -4)]))
     + (252.0 * (1.0 / 6.928330802e+06) *  x[get_offset(filter_state,  -5)])

     + (  -0.4441854896 * y[get_offset(filter_state, -10)])
     + (   4.2144719035 * y[get_offset(filter_state,  -9)])
     + ( -18.5365677633 * y[get_offset(filter_state,  -8)])
     + (  49.7394321983 * y[get_offset(filter_state,  -7)])
     + ( -90.1491003509 * y[get_offset(filter_state,  -6)])
     + ( 115.3235358151 * y[get_offset(filter_state,  -5)])
     + (-105.4969191433 * y[get_offset(filter_state,  -4)])
     + (  68.1964705422 * y[get_offset(filter_state,  -3)])
     + ( -29.8484881821 * y[get_offset(filter_state,  -2)])
     + (   8.0012026712 * y[get_offset(filter_state,  -1)]);
}

void apply_lowpass(FilterState& filter_state, const int16_t* input, int16_t* output, int length)
{
     for (int i = 0; i < length; ++i) {
          push_sample(filter_state, input[i]);
          apply_lowpass(filter_state);
          output[i] = get_output_sample(filter_state);
     }
}
dsp.h
#include <cstdint>

struct FilterState {
	static constexpr int size = 16;

    double input[size];
    double output[size];
	unsigned int current;

	FilterState() : input{}, output{}, current{} {}
};

void apply_lowpass(FilterState& filter_state, const int16_t* input, int16_t* output, int length);

Here is the Java adaptation of the C code:

package com.example.perftest;

import com.example.perftest.DspJavaManuallyOptimized.FilterState;

public class DspJava {
	public static class FilterState {
		static final int size = 16;

		final double input[] = new double[size];
		final double output[] = new double[size];

		int current;
	}

	static short clamp(short input) {
		return (short) Math.max(Short.MIN_VALUE, Math.min(Short.MAX_VALUE, input));
	}

	static int getOffset(FilterState filterState, int relativeOffset) {
		return ((filterState.current + relativeOffset) % FilterState.size + FilterState.size) % FilterState.size;
	}

	static void pushSample(FilterState filterState, short sample) {
		filterState.input[getOffset(filterState, 0)] = sample;
		++filterState.current;
	}

	static short getOutputSample(FilterState filterState) {
		return clamp((short) filterState.output[getOffset(filterState, 0)]);
	}
	
	static void applyLowpass(FilterState filterState) {
		final double[] x = filterState.input;
		final double[] y = filterState.output;

		y[getOffset(filterState, 0)] =
		   (  1.0 * (1.0 / 6.928330802e+06) * (x[getOffset(filterState, -10)] + x[getOffset(filterState,  -0)]))
		 + ( 10.0 * (1.0 / 6.928330802e+06) * (x[getOffset(filterState,  -9)] + x[getOffset(filterState,  -1)]))
		 + ( 45.0 * (1.0 / 6.928330802e+06) * (x[getOffset(filterState,  -8)] + x[getOffset(filterState,  -2)]))
		 + (120.0 * (1.0 / 6.928330802e+06) * (x[getOffset(filterState,  -7)] + x[getOffset(filterState,  -3)]))
		 + (210.0 * (1.0 / 6.928330802e+06) * (x[getOffset(filterState,  -6)] + x[getOffset(filterState,  -4)]))
		 + (252.0 * (1.0 / 6.928330802e+06) *  x[getOffset(filterState,  -5)])

		 + (  -0.4441854896 * y[getOffset(filterState, -10)])
		 + (   4.2144719035 * y[getOffset(filterState,  -9)])
		 + ( -18.5365677633 * y[getOffset(filterState,  -8)])
		 + (  49.7394321983 * y[getOffset(filterState,  -7)])
		 + ( -90.1491003509 * y[getOffset(filterState,  -6)])
		 + ( 115.3235358151 * y[getOffset(filterState,  -5)])
		 + (-105.4969191433 * y[getOffset(filterState,  -4)])
		 + (  68.1964705422 * y[getOffset(filterState,  -3)])
		 + ( -29.8484881821 * y[getOffset(filterState,  -2)])
		 + (   8.0012026712 * y[getOffset(filterState,  -1)]);
	}

	public static void applyLowpass(FilterState filterState, short[] input, short[] output, int length) {
		for (int i = 0; i < length; ++i) {
			pushSample(filterState, input[i]);
			applyLowpass(filterState);
			output[i] = getOutputSample(filterState);
		}
	}
}

Since all of the Java runtimes tested don’t exploit static optimization opportunities as well as it seems that they could, here is an optimized version that has been inlined and has the modulo replaced with a bit mask:

package com.example.perftest;

public class DspJavaManuallyOptimized {
	public static class FilterState {
		static final int size = 16;

		final double input[] = new double[size];
		final double output[] = new double[size];

		int current;
	}

	public static void applyLowpass(FilterState filterState, short[] input, short[] output, int length) {
		for (int i = 0; i < length; ++i) {
			filterState.input[(filterState.current + 0) & (FilterState.size - 1)] = input[i];
			++filterState.current;
			final double[] x = filterState.input;
			final double[] y = filterState.output;

			y[(filterState.current + 0) & (FilterState.size - 1)] =
			   (  1.0 * (1.0 / 6.928330802e+06) * (x[(filterState.current + -10) & (FilterState.size - 1)] + x[(filterState.current + -0) & (FilterState.size - 1)]))
			 + ( 10.0 * (1.0 / 6.928330802e+06) * (x[(filterState.current + -9) & (FilterState.size - 1)] + x[(filterState.current + -1) & (FilterState.size - 1)]))
			 + ( 45.0 * (1.0 / 6.928330802e+06) * (x[(filterState.current + -8) & (FilterState.size - 1)] + x[(filterState.current + -2) & (FilterState.size - 1)]))
			 + (120.0 * (1.0 / 6.928330802e+06) * (x[(filterState.current + -7) & (FilterState.size - 1)] + x[(filterState.current + -3) & (FilterState.size - 1)]))
			 + (210.0 * (1.0 / 6.928330802e+06) * (x[(filterState.current + -6) & (FilterState.size - 1)] + x[(filterState.current + -4) & (FilterState.size - 1)]))
			 + (252.0 * (1.0 / 6.928330802e+06) *  x[(filterState.current + -5) & (FilterState.size - 1)])

			 + (  -0.4441854896 * y[(filterState.current + -10) & (FilterState.size - 1)])
			 + (   4.2144719035 * y[(filterState.current + -9) & (FilterState.size - 1)])
			 + ( -18.5365677633 * y[(filterState.current + -8) & (FilterState.size - 1)])
			 + (  49.7394321983 * y[(filterState.current + -7) & (FilterState.size - 1)])
			 + ( -90.1491003509 * y[(filterState.current + -6) & (FilterState.size - 1)])
			 + ( 115.3235358151 * y[(filterState.current + -5) & (FilterState.size - 1)])
			 + (-105.4969191433 * y[(filterState.current + -4) & (FilterState.size - 1)])
			 + (  68.1964705422 * y[(filterState.current + -3) & (FilterState.size - 1)])
			 + ( -29.8484881821 * y[(filterState.current + -2) & (FilterState.size - 1)])
			 + (   8.0012026712 * y[(filterState.current + -1) & (FilterState.size - 1)]);
			output[i] = (short) Math.max(Short.MIN_VALUE, Math.min(Short.MAX_VALUE, (short) filterState.output[(filterState.current + 0) & (FilterState.size - 1)]));
		}
	}
}