atom feed30 messages in org.python.python-dev[Python-Dev] Optimization targets
FromSent OnAttachments
Raymond HettingerApr 13, 2004 8:00 pm 
Jeff EplerApr 13, 2004 9:10 pm 
Bob IppolitoApr 13, 2004 9:26 pm 
Jeff EplerApr 13, 2004 10:04 pm 
Raymond HettingerApr 13, 2004 10:17 pm 
Jeff EplerApr 13, 2004 11:10 pm 
Guido van RossumApr 13, 2004 11:26 pm 
Tim PetersApr 13, 2004 11:56 pm 
Jeff EplerApr 14, 2004 9:08 am 
Raymond HettingerApr 14, 2004 12:06 pm 
Andrew MacIntyreApr 14, 2004 3:23 pm 
Jeff EplerApr 14, 2004 3:35 pm 
Mike PallApr 14, 2004 5:50 pm 
Tim PetersApr 14, 2004 11:14 pm 
Michael HudsonApr 15, 2004 7:05 am 
Mike PallApr 15, 2004 9:36 am 
Guido van RossumApr 15, 2004 10:27 am 
Jeremy HyltonApr 15, 2004 10:38 am 
Guido van RossumApr 15, 2004 10:42 am 
Mike PallApr 15, 2004 11:56 am 
Mike PallApr 15, 2004 11:56 am 
Skip MontanaroApr 15, 2004 11:59 am 
Michael HudsonApr 15, 2004 1:27 pm 
Raymond HettingerApr 15, 2004 2:22 pm 
Thomas HellerApr 15, 2004 2:31 pm 
"Martin v. Löwis"Apr 15, 2004 3:07 pm 
Jeremy HyltonApr 15, 2004 11:26 pm 
Tim PetersApr 16, 2004 12:18 am 
"Martin v. Löwis"Apr 16, 2004 2:00 am 
Andrew MacIntyreApr 16, 2004 9:14 pm 
Subject:[Python-Dev] Optimization targets
From:Mike Pall (
Date:Apr 15, 2004 9:36:32 am


mwh wrote:

(x_divmod is the hog, not l_divmod).

Probably a fine candidate function for rewriting in assembly too...

As a data point: I once had the doubtful pleasure to write a long-integer library for cryptography. Hand-crafted x86 assembler outperforms plain (but carefully optimized) C code by a factor of 2 to 3.

But Python's long-int code is a lot slower than e.g. gmp (factor 15-25 for mul/div, factor 100 for modular exponentiation).

I assume the difference between C and assembler is less pronounced with other processors.

The register pressure issue may soon be a moot point with x86-64, though. It has been shown that 64 bit pointers slow things down a bit, but compilers just love the extra registers (R8-R15).

But GCC has more to offer: read the man page entries for -fprofile-arcs and -fbranch-probabilities. Here is a short recipe:

I tried this on the ibook and I found that it made a small difference *on the program you ran to generate the profile data* (e.g. pystone), but made naff all difference for something else. I can well believe that it makes more difference on a P4 or G5.

For x86 even profiling python -c 'pass' makes a major difference. And the speed-ups are applicable to almost any program, since the branch predictions for eval_frame and lookdict_string affect all Python programs.

I'm currently engaged in a private e-mail conversation with Raymond on how to convince GCC to generate good code on x86 without the help of profiling.

I wrote a rant about improving Python's performance, which I've finally got around to uploading:

Tell me what you think!

About GC: yes, refcounting is the silent killer. But there's a lot to optimize even without discarding refcounting. E.g. the code generated for Py_DECREF is awful (spread across >3500 locations) and PyObject_GC_UnTrack needs some work, too.

About psyco: I think it's wonderful. But, you are right, nobody is using it. Why? Simple: It's not 'on' by default.

About type inference: maybe the right way to go with Python is lazy and pure runtime type inference? Close to what psyco does.

About type declarations: this is too contrary to the Pythonic way of thinking. And before we start to implement this, we should make sure that it's a lot faster than a pure dynamic type inferencing approach.

About PyPy: very interesting, will take a closer look. But there's still a long road ahead ...

Bye, Mike