Today I built the decoder with profiling and debugging information disabled to compare the speed with that of the reference implementation. When building the reference implementation I disabled MMX, so I am just comparing C code with C code. At the end my code can also be speed up using SIMD code.
To measure the speed, I used `time’. The first video I tried is a small video of just a few seconds. Using the reference implementation:
When using my decoder:
The first video is a longer video. Using the reference implementation:
Using my decoder:
I had a look at what makes the difference. It appears that because I cache halfpel interpolated reference frames I save lots of time. This is not being done for the reference implementation. The reference implementation recalculated the interpolated frame every time.