Full sub-counter decomposition of compileInner — named leaf timers plus (self) residuals (a parent's time not covered by a named child, e.g. the autodiff transform in linkAndOptimizeIR (self)). Topmost band traces compileInner; hover a band for its phase.
Compiled Slang source
exact compiled source (N = 300); long files show the first 40 lines, the area around computeMain (±40), and the last 40 lines (gaps elided)
loop_unroll.slang
// AUTO-GENERATED by perf-suite/workloads.py — do not edit by hand.
RWStructuredBuffer<float> outBuf;
[shader("compute")]
[numthreads(1,1,1)]
void computeMain(uint3 tid : SV_DispatchThreadID)
{
float acc = outBuf[tid.x];
[ForceUnroll] for (int i = 0; i < 300; ++i)
acc = acc * 1.0009 + sin(acc + float(i)) * 0.5 - cos(acc * 0.5);
outBuf[0] = acc;
}