atom feed30 messages in org.python.python-dev[Python-Dev] String hash function mul...
FromSent OnAttachments
Raymond HettingerApr 13, 2004 8:00 pm 
Jeff EplerApr 13, 2004 9:10 pm 
Bob IppolitoApr 13, 2004 9:26 pm 
Jeff EplerApr 13, 2004 10:04 pm 
Raymond HettingerApr 13, 2004 10:17 pm 
Jeff EplerApr 13, 2004 11:10 pm 
Guido van RossumApr 13, 2004 11:26 pm 
Tim PetersApr 13, 2004 11:56 pm 
Jeff EplerApr 14, 2004 9:08 am 
Raymond HettingerApr 14, 2004 12:06 pm 
Andrew MacIntyreApr 14, 2004 3:23 pm 
Jeff EplerApr 14, 2004 3:35 pm 
Mike PallApr 14, 2004 5:50 pm 
Tim PetersApr 14, 2004 11:14 pm 
Michael HudsonApr 15, 2004 7:05 am 
Mike PallApr 15, 2004 9:36 am 
Guido van RossumApr 15, 2004 10:27 am 
Jeremy HyltonApr 15, 2004 10:38 am 
Guido van RossumApr 15, 2004 10:42 am 
Mike PallApr 15, 2004 11:56 am 
Mike PallApr 15, 2004 11:56 am 
Skip MontanaroApr 15, 2004 11:59 am 
Michael HudsonApr 15, 2004 1:27 pm 
Raymond HettingerApr 15, 2004 2:22 pm 
Thomas HellerApr 15, 2004 2:31 pm 
"Martin v. Löwis"Apr 15, 2004 3:07 pm 
Jeremy HyltonApr 15, 2004 11:26 pm 
Tim PetersApr 16, 2004 12:18 am 
"Martin v. Löwis"Apr 16, 2004 2:00 am 
Andrew MacIntyreApr 16, 2004 9:14 pm 
Subject:[Python-Dev] String hash function multiplier
From:Tim Peters (
Date:Apr 14, 2004 11:14:05 pm


On the P4, the documented latency went up from 4 cycles to 14 cycles while shifts and adds went down to 0.5 cycles and 1 cycle respectively. Timings confirm the result.

What makes the current stepping of some flavor of P4 overwhelmingly important <0.3 wink>? Tradeoffs at this level change with every chip generation, and within a generation vary across vendors. In particular, they can make int multiply as fast as they want just by throwing more highly regular silicon at it (which is why its relative performance varies so much across chip generations, and is usually worst in the earliest iterations of a new generation).

My overwhelming concern isn't micro-efficiency of the generated code for the string-hash loop, it's that the resulting hash have high quality. I don't know that the "little prime" is vulnerable to bad cases in non-contrived real life, but we have years of evidence suggesting that the "big prime" isn't. If we can develop evidence that the little prime doesn't either, great, then I lose all objections to using it.

It looks like the best bet is to try to speedup the code without changing the multiplier. Intel's software optimization cookbook recommends a partial unrolling and elimination of data dependencies so that a second multiply can start 4 cycles after the previous one started. If practice bears out the theory, the timings could show a three or fourfold speedup without changing the multiplier.

Or leave it alone. Hand unrolling a loop all but guarantees that the next generation of compiler optimizer won't be able to reverse-engineer the original "natural" loop structure, and so won't be able to do the code transformation that works best for the next generation of hardware tricks. IOW, once you start down the path of hand-optimizing for a specific compiler + HW combo, there are two outcomes: (1) someone makes this their job forever after; or, (2) the performance of the code actually gets worse over time (compared to what it would have been if the original code had been left dirt simple).

I happen to have an app (spambayes) that makes very heavy use of string-keyed dicts, and where interning is impractical (so the string hashes can't be avoided). Still, speed up string hashing by a factor of 10, and it won't make enough difference in that app's throughput to mention. I also have a hyper-threaded box today, and the imul stalls in this P4 give great opportunities for other processes to get work done <0.9 wink>.