|Raymond Hettinger||Apr 13, 2004 8:00 pm|
|Jeff Epler||Apr 13, 2004 9:10 pm|
|Bob Ippolito||Apr 13, 2004 9:26 pm|
|Jeff Epler||Apr 13, 2004 10:04 pm|
|Raymond Hettinger||Apr 13, 2004 10:17 pm|
|Jeff Epler||Apr 13, 2004 11:10 pm|
|Guido van Rossum||Apr 13, 2004 11:26 pm|
|Tim Peters||Apr 13, 2004 11:56 pm|
|Jeff Epler||Apr 14, 2004 9:08 am|
|Raymond Hettinger||Apr 14, 2004 12:06 pm|
|Andrew MacIntyre||Apr 14, 2004 3:23 pm|
|Jeff Epler||Apr 14, 2004 3:35 pm|
|Mike Pall||Apr 14, 2004 5:50 pm|
|Tim Peters||Apr 14, 2004 11:14 pm|
|Michael Hudson||Apr 15, 2004 7:05 am|
|Mike Pall||Apr 15, 2004 9:36 am|
|Guido van Rossum||Apr 15, 2004 10:27 am|
|Jeremy Hylton||Apr 15, 2004 10:38 am|
|Guido van Rossum||Apr 15, 2004 10:42 am|
|Mike Pall||Apr 15, 2004 11:56 am|
|Mike Pall||Apr 15, 2004 11:56 am|
|Skip Montanaro||Apr 15, 2004 11:59 am|
|Michael Hudson||Apr 15, 2004 1:27 pm|
|Raymond Hettinger||Apr 15, 2004 2:22 pm|
|Thomas Heller||Apr 15, 2004 2:31 pm|
|"Martin v. Löwis"||Apr 15, 2004 3:07 pm|
|Jeremy Hylton||Apr 15, 2004 11:26 pm|
|Tim Peters||Apr 16, 2004 12:18 am|
|"Martin v. Löwis"||Apr 16, 2004 2:00 am|
|Andrew MacIntyre||Apr 16, 2004 9:14 pm|
|Subject:||[Python-Dev] String hash function multiplier|
|From:||Tim Peters (tim_...@email.msn.com)|
|Date:||Apr 14, 2004 11:14:05 pm|
On the P4, the documented latency went up from 4 cycles to 14 cycles while shifts and adds went down to 0.5 cycles and 1 cycle respectively. Timings confirm the result.
What makes the current stepping of some flavor of P4 overwhelmingly important <0.3 wink>? Tradeoffs at this level change with every chip generation, and within a generation vary across vendors. In particular, they can make int multiply as fast as they want just by throwing more highly regular silicon at it (which is why its relative performance varies so much across chip generations, and is usually worst in the earliest iterations of a new generation).
My overwhelming concern isn't micro-efficiency of the generated code for the string-hash loop, it's that the resulting hash have high quality. I don't know that the "little prime" is vulnerable to bad cases in non-contrived real life, but we have years of evidence suggesting that the "big prime" isn't. If we can develop evidence that the little prime doesn't either, great, then I lose all objections to using it.
It looks like the best bet is to try to speedup the code without changing the multiplier. Intel's software optimization cookbook recommends a partial unrolling and elimination of data dependencies so that a second multiply can start 4 cycles after the previous one started. If practice bears out the theory, the timings could show a three or fourfold speedup without changing the multiplier.
Or leave it alone. Hand unrolling a loop all but guarantees that the next generation of compiler optimizer won't be able to reverse-engineer the original "natural" loop structure, and so won't be able to do the code transformation that works best for the next generation of hardware tricks. IOW, once you start down the path of hand-optimizing for a specific compiler + HW combo, there are two outcomes: (1) someone makes this their job forever after; or, (2) the performance of the code actually gets worse over time (compared to what it would have been if the original code had been left dirt simple).
I happen to have an app (spambayes) that makes very heavy use of string-keyed dicts, and where interning is impractical (so the string hashes can't be avoided). Still, speed up string hashing by a factor of 10, and it won't make enough difference in that app's throughput to mention. I also have a hyper-threaded box today, and the imul stalls in this P4 give great opportunities for other processes to get work done <0.9 wink>.