The computation including a call to Stride() can't be optimized away
safely because the compiler can't tell that Stride() is effectively
constant, but we know it won't change so we can make a slice pointing
at that part of the array.
CPU time for updateData goes from 26.35% to 18.65% in my test case.
A slice of points means copying every point into the slice, then
copying every point's data from the slice to TrianglesData. An
array of indicies lets the compiler make better choices.
For polyline, don't compute each normal twice; when we're going through a line,
the "next" normal for segment N is always the "previous" normal for segment
N+1, and we can compute fewer of them.
For internal operations (anything using getAndClearPoints), there's a
pretty good chance that the operation will repeatedly invoke something
like fillPolygon(), meaning that it needs to push "a few" points
and then invoke something that uses those points.
So, we add a slice for containing spare slices of points, and on the
way out of each such function, shove the current imd.points (as used
inside that function) onto a stack, and set imd.points to [0:0] of
the thing it was called with.
Performance goes from 11-13fps to 17-18fps on my test case.
It turns out that affine matrices are much simpler than the 3x3 matrices
they imply, and we can use this to dramatically streamline some code.
For a test program, this was about a 50% gain in frame rate just from
the cost of the applyMatrixAndMask calls in imdraw, which were calling
matrix.Project() many times. Simplifying matrix.Project, alone, got a
nearly 50% frame rate boost!
Also modify pixelgl's SetMatrix to copy the six values of a 3x2
Affine into the corresponding locations of a 3x3 matrix.
Removing the call to Alpha(1) and replacing it with an inline definition
produces measurable improvements. Replacing each instance of ZV with
Vec{} further improves things. We keep an inline RGBA because there
are circumstances (mostly when using pictures) where we don't want to
have to set colors to get default behavior.
For a fairly triangle-heavy thing, this reduces time spent in SetLen
from something over 10% of execution time to around 2.5% of execution
time.