Wednesday, November 16, 2016

a quick note for users/distros

At this point, I haven't pushed a new release tag for xf86-video-freedreno to update to latest xserver ABI.  I'm inclined not to.  If you are using a modern xserver you probably want to be using xf86-video-modesetting + glamor.  It has more features (dri3, xv, etc) and better performance.  And GL support on a3xx/a4xx is pretty solid.  So distros with a modern xserver might as well drop the xf86-video-freedreno package.

The one case where xf86-video-freedreno is still useful is bringing up a new generation of adreno, since it can do dri2 with pure-sw fallbacks for all the EXA ops.  But if that is what you are doing, I guess you know how to git clone and build.

The possible alternative is to push a patch that makes xf86-video-freedreno still build, but only probe (with latest xserver ABI) if some "ForceLoad" type option is given in xorg.conf, otherwise fallback to modesetting/glamor.  I can't think of a good reason to do this at the moment.  But as always, questions/comments/suggestions welcome.

Saturday, July 30, 2016

dirty tricks for moar fps!

This weekend I landed a patchset in mesa to add support for resource shadowing and batch re-ordering in freedreno.  What this is, will take a bit of explaining, but the tl;dr: is a nice fps boost in many games/apps.

But first, a bit of background about tiling gpu's:  the basic idea of a tiler is to render N draw calls a tile at a time, with a tile's worth of the "framebuffer state" (ie. each of the MRT color bufs + depth/stencil) resident in an internal tile buffer.  The idea is that most of your memory traffic is to/from your color and z/s buffers.  So rather than rendering each of your draw calls in it's entirety, you split the screen up into tiles and repeat each of the N draws for each tile to fast internal/on-chip memory.  This avoids going back to main memory for each of the color and z/s buffer accesses, and enables a tiler to do more with less memory bandwidth.  But it means there is never a single point in the sequence of draws.. ie. draw #1 for tile #2 could happen after draw #2 for tile #1.  (Also, that is why GL_TIMESTAMP queries are bonkers for tilers.)

For purpose of discussion (and also how things are named in the code, if you look), I will define a tile-pass, ie. rendering of N draws for each tile in succession (or even if multiple tiles are rendered in parallel) as a "batch".

Unfortunately, many games/apps are not written with tilers in mind.  There are a handful of common anti-patterns which force a driver for a tiling gpu to flush the current batch.  Examples are unnecessary FBO switches, and texture or UBO uploads mid-batch.

For example, with a 1920x1080 r8g8b8a8 render target, with z24s8 depth/stencil buffer, an unnecessary batch flush costs you 16MB of write memory bandwidth, plus another 16MB of read when we later need to pull the data back into the tile buffer.  That number can easily get much bigger with games using float16 or float32 (rather than 8 bits per component) intermediate buffers, and/or multiple render targets.  Ie. two MRT's with float16 internal-format plus z24s8 z/s would be 40MB write + 40MB read per extra flush.

So, take the example of a UBO update, at a point where you are not otherwise needing to flush the batch (ie. swapbuffers or FBO switch).  A straightforward gl driver for a tiler would need to flush the current batch, so each of the draws before the UBO update would see the old state, and each of the draws after the UBO update would see the new state.

Enter resource shadowing and batch reordering.  Two reasonably big (ie. touches a lot of the code) changes in the driver which combine to avoid these extra batch flushes, as much as possible.

Resource shadowing is allocating a new backing GEM buffer object (BO) for the resource (texture/UBO/VBO/etc), and if necessary copying parts of the BO contents to the new buffer (back-blit).

So for the example of the UBO update, rather than taking the 16MB+16MB (or more) hit of a tile flush, why not just create two versions of the UBO.  It might involve copying a few KB's of UBO (ie. whatever was not overwritten by the game), but that is a lot less than 32MB?

But of course, it is not that simple.  Was the buffer or texture level mapped with GL_MAP_INVALIDATE_BUFFER_BIT or GL_MAP_INVALIDATE_RANGE_BIT?  (Or GL API that implies the equivalent, although fortunately as a gallium driver we don't have to care so much about all the various different GL paths that amount to the same thing for the hw.)  For a texture with mipmap levels, we unfortunately don't know at the time where we need to create the new shadow BO whether the next GL calls will glGenerateMipmap() or upload the remaining mipmap levels.  So there is a bit of complexity in handling all the cases properly.  There may be a few more cases we could handle without falling back to flushing the current batch, but for now we handle all the common cases.

The batch re-ordering component of this allows any potential back-blits from the shadow'd BO to the new BO (when resource shadowing kicks in), to be split out into a separate batch.  The resource/dependency tracking between batches and resources (ie. if various batches need to read from a given resource, we need to know that so they can be executed before something writes to the resource) lets us know which order to flush various in-flight batches to achieve correct results.  Note that this is partly because we use util_blitter, which turns any internally generated resource-shadowing back-blits into normal draw calls (since we don't have a dedicated blit pipe).. but this approach also handles the unnecessary FBO switch case for free.

Unfortunately, the batch re-ordering required a bit of an overhaul about how cmdstream buffers are handled, which required changes in all layers of the stack (mesa + libdrm + kernel).  The kernel changes are in drm-next for 4.8 and libdrm parts are in the latest libdrm release.  And while things will continue to work with a new userspace and old kernel, all these new optimizations will be disabled.

(And, while there is a growing number of snapdragon/adreno SBC's and phones/tablets getting upstream attention, if you are stuck on a downstream 3.10 kernel, look here.)

And for now, even with a new enough kernel, for the time being reorder support is not enabled by default.  There are a couple more piglit tests remaining to investigate, but I'll probably flip it to be enabled by default (if you have a new enough kernel) before the next mesa release branch.  Until then, use FD_MESA_DEBUG=reorder (and once the default is switched, that would be FD_MESA_DEBUG=noreorder to disable).

I'll cover the implementation and tricks to keep the CPU overhead of all this extra bookkeeping small later (probably at XDC2016), since this post is already getting rather long.  But the juicy bits: ~30% gain in supertuxkart (new render engine) and ~20% gain in manhattan are the big winners.  In general at least a few percent gain in most things I looked at, generally in the 5-10% range.





Wednesday, May 4, 2016

Freedreno (not so) periodic update

Since I seem to be not so good at finding time for blog updates recently, this update probably covers a greater timespan than it should, and some of this is already old news ;-)

Already quite some time ago, but in case you didn't already notice: with the mesa 11.1 release, freedreno now supports up to (desktop) gl3.1 on both a3xx and a4xx (in addition to gles3).  Which is high enough to show up on the front page at glxinfo.  (Which, btw, is a useful tool to see exactly which gl/gles extensions are supported by which version of mesa on various different hw.)

A couple months back, I spent a bit of time starting to look at performance.  On master now (so will be in 11.3), we have timestamp and time-elapsed query support for a4xx, and I may expose a few more performance counters (mostly for the benefit of gallium HUD).  I still need to add support for a3xx, but already this is useful to help profile.  In addition, I've cobbled together a simple fdperf cmdline tool:



I also got around to (finally) implementing hw binning support for a4xx, which for *some* games can have a pretty big perf boost:
  • glmark2 'refract' bench (an extreme example): 31fps -> 124fps
  • xonotic (med): 44.4fps -> 50.3fps
  • supertuxkart (new render engine): 15fps -> 19fps
More recently I've started to run the dEQP gles3 tests against freedreno.  Initially the results where not too good, but since then I've fixed a couple thousand test cases.. fortunately it was just a few bugs and a couple missing workaround for hw bug/limitations (depending on how you look at it) which counted for the bulk of the fails.  Now we are at 98.9% pass (or 99.5% if you don't count the 'skips' against the pass ratio).  These fixes have also helped piglit, where we are now up to 98.3% pass.  These figures are a4xx, but most of the fixes apply to a3xx as well.

I've also made some improvements in ir3 (shader compiler for a3xx and later) so the code it generates is starting to be pretty decent.  The immediate->const lowering that I just pushed helps reduce register pressure in a lot of cases.  We still need support for spilling, but at least now shadertoy (which is some sort of cruel joke against shader compiler writers) isn't a complete horror show:



In other cool news, in case you had not already seen: Rob Herring and John Stultz from linaro have been doing some cool work, with Rob getting android running on an upstream kernel plus mesa running on db410c and qemu (with freedreno and virtgl), and John taking all that, and getting it all running on a nexus7 tablet.  (And more recently, getting wifi working as well.)  I had the opportunity to see this in person when I was at Linaro Connect in March.  It might not seem impressive if you are unfamiliar with the extent to which android device kernels diverge from upstream, but to see an upstream kernel running on an actual device with only ~50patches is quite a feat:



The UI was actually reasonably fast, despite not yet using overlays to bypass GPU for composition.  But as ongoing work in drm/kms for explicit fencing, and mesa EGL_ANDROID_native_fence_sync land, we should be able to get hw composition working.