|Raymond Hettinger||Apr 13, 2004 8:00 pm|
|Jeff Epler||Apr 13, 2004 9:10 pm|
|Bob Ippolito||Apr 13, 2004 9:26 pm|
|Jeff Epler||Apr 13, 2004 10:04 pm|
|Raymond Hettinger||Apr 13, 2004 10:17 pm|
|Jeff Epler||Apr 13, 2004 11:10 pm|
|Guido van Rossum||Apr 13, 2004 11:26 pm|
|Tim Peters||Apr 13, 2004 11:56 pm|
|Jeff Epler||Apr 14, 2004 9:08 am|
|Raymond Hettinger||Apr 14, 2004 12:06 pm|
|Andrew MacIntyre||Apr 14, 2004 3:23 pm|
|Jeff Epler||Apr 14, 2004 3:35 pm|
|Mike Pall||Apr 14, 2004 5:50 pm|
|Tim Peters||Apr 14, 2004 11:14 pm|
|Michael Hudson||Apr 15, 2004 7:05 am|
|Mike Pall||Apr 15, 2004 9:36 am|
|Guido van Rossum||Apr 15, 2004 10:27 am|
|Jeremy Hylton||Apr 15, 2004 10:38 am|
|Guido van Rossum||Apr 15, 2004 10:42 am|
|Mike Pall||Apr 15, 2004 11:56 am|
|Mike Pall||Apr 15, 2004 11:56 am|
|Skip Montanaro||Apr 15, 2004 11:59 am|
|Michael Hudson||Apr 15, 2004 1:27 pm|
|Raymond Hettinger||Apr 15, 2004 2:22 pm|
|Thomas Heller||Apr 15, 2004 2:31 pm|
|"Martin v. Löwis"||Apr 15, 2004 3:07 pm|
|Jeremy Hylton||Apr 15, 2004 11:26 pm|
|Tim Peters||Apr 16, 2004 12:18 am|
|"Martin v. Löwis"||Apr 16, 2004 2:00 am|
|Andrew MacIntyre||Apr 16, 2004 9:14 pm|
|Subject:||[Python-Dev] Optimization targets (was: String hash function multiplier)|
|From:||Mike Pall (mike...@mike.de)|
|Date:||Apr 14, 2004 5:50:24 pm|
It looks like the best bet is to try to speedup the code without changing the multiplier.
Indeed. And while you are at it, there are other optimizations, that seem more promising:
I compiled a recent CVS Python with profiling and here is a list of the top CPU hogs (on a Pentium III, your mileage may vary):
CPU% Function Name ---------------------------- 55.44 eval_frame 7.30 lookdict_string 4.34 PyFrame_New 3.73 frame_dealloc 1.73 vgetargs1 1.65 PyDict_SetItem 1.42 string_richcompare 1.15 PyObject_GC_UnTrack 1.11 PyObject_RichCompare 1.08 PyInt_FromLong 1.08 tupledealloc 1.04 insertdict
CPU% Function Name ---------------------------- 23.65 eval_frame 8.68 l_divmod 4.43 lookdict_string 2.95 k_mul 2.27 PyType_IsSubtype 2.23 PyObject_Malloc 2.09 x_add 2.05 PyObject_Free 2.05 tupledealloc
Arguably parrotbench is a bit unrepresentative here. And beware: due to automatic inlining of static functions the real offender may be hidden (x_divmod is the hog, not l_divmod).
Anyway, this just confirms that the most important optimization targets are: 1. eval_frame 2. string keyed dictionaries 3. frame handling
I think 3. needs an algorithmic approach (there was some discussion about it a few weeks ago), while 1. and 2. look like a code generation issue.
So I took a look at the machine code that GCC generates on x86 with -O3 for these: uh oh, not a pretty sight. The main problems are bad branch predictions and lack of inlining.
About branch predictions:
The worst offender is the main code path for eval_frame: it gets split across half a dozen segments. The code path for the non-argument bytecodes is in fact the slowest.
Similarly the code for lookdict_string branches around like crazy. The most likely code path is not the fastest path.
One solution is to use likely()/unlikely() macros (using __builtin_expect). This is a good solution for small, isolated and performance critical code. It does not work well for eval_frame, though (I tried).
But GCC has more to offer: read the man page entries for -fprofile-arcs and -fbranch-probabilities. Here is a short recipe:
Go to your Python source directory and do this:
$ mkdir build_profile $ cd build_profile $ ../configure # add any options you may need $ make $ mv python python_orig
Edit the Makefile and add -fprofile-arcs to OPT.
$ rm Python/ceval.o $ make $ mv python python_profile
Run your favourite benchmark(s), but only invoke python *once*:
$ ./python_profile -c 'import test.pystone; test.pystone.main(loops=100000)'
Forget about the performance numbers the benchmark reports. Never use an executable compiled with profiling for comparison. But ... you should now have a new (binary) file called Python/ceval.da that contains the profiled branch probabilities.
Edit the Makefile and replace -fprofile-arcs with -fbranch-probabilities
$ rm Python/ceval.o $ make $ mv python python_opt
Then compare the benchmarks:
$ ./python_orig -c 'import test.pystone; test.pystone.main(loops=100000)' $ ./python_opt -c 'import test.pystone; test.pystone.main(loops=100000)'
On my machine I get a 10-15% speedup. But we only optimized ceval.c ...
So repeat the above steps, but now delete Python/ceval.o and Objects/*.o each time. Now I get about 20-25% speedup for pystone and 10% speedup for parrotbench! Not bad, huh?
Now the bad news: I don't know how to integrate this into the regular build process. So this is not an option for everyone, but ambitious packagers might want to take the trouble to do this by hand.
Oh and of course neither pystone nor parrotbench are representative of the branch probabilities of any particular application. But for predicting eval_frame and lookdict_string, they are probably good enough.
On a related note: GCC uses random branch probabilities with -O, when no probability information is present (no __builtin_expect or *.da files). Use -fno-guess-branch-probability if you want predictable timings on recompiles.
- The opcode decoding with NEXTOP/HAS_ARG/NEXTARG does not compile well. GCC has trouble inferring the lifetimes for opcode and oparg, too. Ideas: - Use a special inline assembly variant for x86. OR - Move NEXTARG to each opcode that needs it and make oparg a local block variable. - Use NEXTOP directly for the switch and refetch opcode into a local block variable where necessary.
- The Py_Ticker check and the tracing support should be moved out of the core loop. The Py_Ticker check may be put into a subroutine and called from selected opcodes only (haven't checked if that works). I have no (good) idea about how to get rid of the tracing support, though.
- _PyString_Eq() is a good candidate to inline for lookdict_string(). I.e. a new macro in stringobject.h. But wait ...
- GCC generates pretty slow code for _PyString_Eq() (at least on x86). This is mainly due to a bad branch prediction (we should use likely()) but also due to the penalty for 8-bit partial register load/compare generated by the ob_sval dereferences. The latter is completely useless because memcmp() is inlined (with repz cmpsb) and duplicates that comparison. Omitting the *ob_sval comparison is faster.
- Create a macro that inlines the first part of PyObject_Hash(). Useful for PyDict_SetItem() and others.
I bet there are more, but I'm running out of time right now -- sorry.