updateData()'s loops checking gt.Len() turns out to have been costing
significant computation, not least because each call then in turn
called gt.vs.Stride().
The computation including a call to Stride() can't be optimized away
safely because the compiler can't tell that Stride() is effectively
constant, but we know it won't change so we can make a slice pointing
at that part of the array.
CPU time for updateData goes from 26.35% to 18.65% in my test case.