The reason why using assembly is better than the gcc compiler for the older hardware is because you can use the vector operations on the VFP and pipeline it better. Also, if you are doing a lot of pointer walks you can preload the cache with the pld instruction, which the gcc compiler never uses. So you can gain big performance by doing this at the cost of time for debugging. The code base of Space Tripper was written by Miles Visman of Pompom games and he's no slouch at writing very good C/C++ code, it's just that there is so much going on that 412 clock rate on the CPU is not quick enough and assembly just makes things possible with this code base. ARM has got very good singles processing instructions with true SIMD. You can get big gains here going into assembly and can process up to 8 bytes at a time.