On a dual issue processor GCC generates code that saves a couple of
clock cycles per loop if we rearrange things slightly. Checking for
p != end saves a SLTU per loop, moving the increment to the middle can
let it dual issue on multi-issue processors.