Wednesday, August 12, 2009

NEON is fashionable

... as long as we aren't talking about your wardrobe..

so, my college Daniel brought to my attention a gstreamer use-case that was in need of some performance optimization. When decoding DVD content, the audio samples from the AC-3 decoder are in float-32bit format. These need to be converted to 16bit integer format to play through alsasink. But the overhead of audioconvert on the ARM was quite high. Which seemed like a good enough excuse to learn NEON. So last weekend I broke out oprofile and had a look at where the cycles went, and what needed to be optimized. The pipeline I used for testing was:

gst-launch filesrc use-mmap=true num-buffers=12000 location=/my-clip.vob ! dvddemux name=d d.current_audio ! a52dec ! audioconvert ! audio/x-raw-int,width=16,depth=16! fakesink

Initially, this pipeline took roughly 20.7sec. By comparision, the same pipeline without audioconvert:

gst-launch filesrc use-mmap=true num-buffers=12000 location=/my-clip.vob ! dvddemux name=d d.current_audio ! a52dec ! fakesink

took roughly 11.1s. A couple functions immediately stood out: gst_audio_quantize_quantize_signed_tpdf_none() and audio_convert_unpack_float_le(). The former was particularly bad just due to the random number generation for dithering, which contained a divide by 0xffffffff. As Barbie (and processors everywhere) say, division is hard. Many embedded processors will emulate division (read: expensive), and even processors that have divide instructions, the cycle count is high. Just changing this to 32bit right shift (ie. divide by (1LLU<<32)) made a big improvement. Results might not be exactly the same, but it is close enough that you couldn't hear the difference. In the end, I left gst_fast_random_int32_range() untouched, but replacement vectorized code kept the >>32 approach.

The next step was to come up with a way to plug in accelerated versions of various audioconvert functions. I changed them all to use __attribute__((weak)) which is a neat trick of the GNU compiler (and ARM ltd. compiler) which lets you provide default versions of some symbol which can be overridden at link time. (Unfortunately this doesn't seem to be supported by all compilers that gstreamer supports, so I need to come up with a more portable approach.) This let me add an armv7 specific file to re-implement these two functions. The algorithms themselves stay basically the same, but are processing four elements at a time, except for a few instructions which use 64bit math and those are processing two at a time. In the end, the pipeline with audioconvert dropped to ~12sec, roughly 10x improvement for audioconvert (through a combination of processing 4x samples at a time, faster psuedo-random number generation, and the fact that NEON provides a saturating addition instruction). That is with gcc NEON intrinsics. I guess a few cycles could be saved by writing it all in assembly, which was my original plan, but at this point these two functions only show up a couple pages down in oprofile output.. so time would be better spent looking into liba52 (audio decoder). The video part of a playback pipeline is no issue, the DSP on OMAP3 can decode this without breaking a sweat!

Here is the patch. In the end, probably the long-term solution for audioconvert would be to use orc, for cross-platform vector acceleration. Although currently orc doesn't support floating point, and doesn't have a free NEON back-end. I may clean up my patch (to address the portability issues, and make the build system figure out whether or not to build my new armv7.c file depending on the target architecture). In the end, it depends on how much of a short-term solution that would be..

update: I thought I'd add a few links that I found useful
  • some Cortex-A8 / NEON info from TI wiki
  • NEON and VFP Programming section in ARM compiler tools assembler guide, gives a good reference on all the NEON/VFP instructions
  • ARM NEON Intrinsics section from GCC manual (but not really any more info compared to what you can find by just looking at arm_neon.h).. but once you know the instruction you want (from previous link) this is useful to find the matching intrinsic name.
  • Using NEON Support appendix from ARM compiler reference guide.. the GCC intrinsics pretty much match ARM's, and the ARM doc has more useful info

Monday, August 10, 2009

Sending GIT patches via email from behind oppressive proxies

I was wanting to setup a wait to email patches via my gmail account.. I found this post from my friend Nishanth: Nishanth' tech rambles: Setting up Email forwarding System for GIT, which was quite helpful.

On my macosx laptop, sendmail worked out of the box.. but it finds the SMTP server from DNS. So at work it was using an internal SMTP server. And at home, it wasn't finding any SMTP server. And for some reason using the internal SMTP server seemed to only work for sending patches to internal addresses. I'm sure the built-in SMTP client could be configured to work as I needed, but I really had no desire to figure out how.

Following the instructions on Nishanth's post, I setup msmtp. (go macports!) And it works great. Somehow it automagically works properly both behind the proxy at work, and from home, depending on the systemwide proxy settings. (The power of Steve Jobs compels it!)

To send patches, I use this alias:

alias gsend='git send-email --from "<addr>" --envelope-sender "<addr>" --smtp-server /opt/local/bin/msmtp'

(replace <addr> with your email address)

Sunday, August 9, 2009

silly programming games

catching up on some old news... since that's easier than making up new news..

A while back, my buddy Mike decided it would be fun to write a quine. It was clever. But one good quine deserves another (or something like that). But I couldn't think of any approach for a 'C' quine that hadn't already been done in some form or another. And the empty file <insert-your-favorite-script-language-here> quine was too easy. So I wrote a self-disassembler in ARM asm: quine-1.S. Technically, it doesn't do any file I/O (if you ignore the fact that the program loader mmap's executable files / shared libraries), so I don't think it is cheating. If you are curious, the original with comments (slightly easier to follow) is here: quine-0.S. If you happen to have some sort of ARM platform running linux, the comments in the second file should explain how to compile it. If you don't, go out and buy one of these.

First post..

Well, after resisting for so long, I decided to join the 21st century and start a blog... lets see how long this lasts..