I do have some PPro mods, and they appear to help performance on
average. The PPro is a really wierd creature (like the K6.) The
darned processor does so much optimization, it appears to be
insensitive to code mods. There are areas of reasonable payoffs,
and lots of "obvious" optimizations that end up being neutral.
I was working with optimizing assembly code for the PPro a year ago.
My experience was that modifying code seldom mattered, except for
alignement. Making the tight loops hit 16-byte boundaries roughly
doubled the speed.
No other approach made a significant difference, AFAIR (I just
supplied ideas and had another programmer implement them).
All pairing happened automatically, and touching the cache to make it
pre-fetch etc didn't help at all.