Sunday, August 27, 2017

About shader compilers, IR's, and where the time is spent

Occasionally the question comes up about why we convert between various IR's (intermediate representations), like glsl to NIR, in the process of compiling a shader.  Wouldn't it be faster if we just skipped a step and went straight from glsl to "the final thing", which would be ir3 (freedreno), codegen (nouveau), or LLVM (radeonsi/radv).  It is a reasonable question, since most people haven't worked on compilers and we probably haven't done a good job at explaining all the various passes involved in compiling a shader or presenting a breakdown of where the time is spent.

So I spent a bit of time this morning with perf to profile a shader-db run (or rather a subset of a full run to keep the perf.data size manageable, see notes at end).
A flamegraph from the shader-db run, since every blog post needs a catchy picture.

Breakdown:

  • parser, into glsl: 9.98%
  • glsl to nir: 1.3%
  • nir opt/lowering passes: 21.4%
    • CSE: 6.9%
    • opt algebraic: 3.5%
    • conversion to SSA: 2.1%
    • DCE: 2.0%
    • copy propagation: 1.3%
    • other lowering passes: 5.6%
  • nir to ir3: 1.5%
  • ir3 passes:  21.5%
    • register allocation: 5.1%
    • sched: 14.3%
    • other: 2.1%
  • assembly (ir3->binary): 0.66%
This is ignoring some of the fixed overheads of shader-db runner, and also doesn't capture individually a bunch of NIR lowering passes.  NIR has ~40 lowering passes, some that are gl related like nir_lower_draw_pixels and nir_lower_wpos_ytransform (because for hysterical reasons textures and therefore FBO's are upside down in gl).  For gallium drivers using NIR, these gl specific passes are called from mesa state-tracker.

The other lowering passes are not gl specific but tend to be specific to general GPU shader features (ie. things that you wouldn't find in a C compiler for a cpu) and things that are needed by multiple different drivers.  Such as, nir_lower_tex which handles sampling from YUV textures, ie. inserting the instructions to do YUV->RGB conversion (since GLES and android strongly assume this is a thing that hardware can always do), lowering RECT textures, or clamping texture coords.  These lowering passes are called from the driver backend so the driver is in control of what lowering pass are needed, including configuration about individual features in passes which handle multiple things, based on what the hardware does not support directly.

These lowering passes are mostly O(n), and lost in the noise.

Also note that freedreno, along with the other drivers that can consume NIR directly, disable a bunch of opt passes that were originally done in glsl, but that NIR (or LLVM) can do more efficiently.  For freedreno, disabling the glsl opt passes shaved ~30% runtime off of a shader-db run, so spending 1.3% to convert into NIR is way more than offset.

For other drivers, the breakdown may be different.  I expect radeonsi/radv skips some of the general opt passes in NIR which have a counterpart in LLVM, but re-uses other lowering passes which do not have a counterpart in LLVM.


Is it still a gallium driver?


This is a related question that comes up sometimes, is it a gallium driver if it doesn't use TGSI?  Yes.

The drivers that can consume NIR and implement the gallium pipe driver interface, freedreno a3xx+, vc4, vc5, and radeonsi (optionally), are gallium drivers.  They still have to accept TGSI for state trackers which do not support NIR, and various built-in shaders (blits, mipmap generation, etc).  Most use the shared tgsi_to_nir pass for TGSI shaders.  Note that currently tgsi_to_nir does not support all the TGSI features, but just features needed by internal shaders, and what is needed for gl3/gles3 (ie. basically what freedreno and vc4 needed before mesa state-tracker grew support for glsl_to_nir).


Notes:


Collected from shader-db run (glamor + supertuxkart + 0ad shaders) with a debug mesa build (to have debug syms and prevent inlining) but with NIR_VALIDATE=0 (otherwise results with debug builds are highly skewed).  A subset of all shader-db shaders was used to keep the perf.data size manageable.