US Trip Report: POPL, Microsoft, IBM
Some notes from my recent trip (from 23rd of January to 17th of February) to the US where, I presented PyPy at various scientifically oriented places. In summary, there seems to be quite a bit of interest in PyPy within the research community, details below.
PEPM/POPL/STOP
From the 24th to the 29th of January I was in Austin, Texas at the POPL conference, where I gave a talk at one of the workshops, PEPM (Partial Evaluation and Program Manipulation). The title of our paper is "Allocation Removal by Partial Evaluation in a Tracing JIT", the abstract is:
The performance of many dynamic language implementations suffers from high allocation rates and runtime type checks. This makes dynamic languages less applicable to purely algorithmic problems, despite their growing popularity. In this paper we present a simple compiler optimization based on online partial evaluation to remove object allocations and runtime type checks in the context of a tracing JIT. We evaluate the optimization using a Python VM and find that it gives good results for all our (real-life) benchmarks.
The talk (slides) seemed to be well-received and there was a good discussion afterwards. PEPM in general was a very enjoyable workshop with many interesting talks on partial evaluation (which I am very interested in) and a great keynote by Olivier Danvy about "A Walk in the Semantic Park".
POPL itself was a bit outside of the area I am most knowledgeable in, most of the talks being on formal topics. Some of the talks that stuck to my mind:
- "The Design of Kodu: A Tiny Visual Programming Language for Children on the Xbox 360", the keynote by Matthew MacLaurin from Microsoft Research. I didn't know about Kodu before, and was very impressed by it.
- "Automating String Processing in Spreadsheets using Input-Output Examples" (paper) by Sumit Gulwani (also from MS Research) describes a plugin to Excel that can automate many common string processing tasks by giving a couple of examples, which are then abstracted into a generic string manipulation. Very cool.
- "Dynamic Inference of Static Types for Ruby" (paper) by Michael Furr, Jong-hoon (David) An, Jeffrey S. Foster and Michael Hicks describes an approach to type inference that works by observing the actual types seen during unit-testing. Similar things have been done a few times before, however, the paper actually gives a correctness result.
- "The Essence of Compiling with Traces" (paper) by Shu-Yu Guo and Jens Palsberg describes a formalization of a simple imperative language and proves that executing it using trace compilation will do exactly the same thing than using an interpreter. It also looks at what conditions an optimization on traces must fulfill to still produce valid results.
After the main conference, I took part in the STOP (Scripts to Programs) workshop. It had a great keynote "Scripting in a Concurrent World" by John Field about the Thorn language and a few interesting other talks.
Microsoft Research
After POPL I went to Redmond to visit Microsoft Research for a week, specifically the RiSE group. This is the group that did the SPUR project, a meta-tracing JIT for C# applied to a JavaScript interpreter in C#. I compared PyPy to SPUR last year. I am very grateful for Microsoft for inviting me there.
At Microsoft I gave a talk about "PyPy's Approach to Implementing Dynamic Languages Using a Tracing JIT Compiler", the slides of which can be found here. The talk was filmed and is online. People seemed to be impressed with the "product qualities" of PyPy, e.g. the buildbot infrastructure and speed tracking website.
The rest of the time I discussed with various researchers in the RiSE group, particularly with Nikolai Tillmann. We talked a lot about similarities and differences between SPUR and PyPy and tried to understand our respective projects better. SPUR is a really great project and I learned a lot in the discussions, for example about the optimizations and heuristics their trace compiler uses.
Another very cool project done by the RiSE group that I learned more about is PEX. PEX is a unit test generator for C# that tries to produce unit tests for so-far untested execution paths within methods. There is an online puzzle version of it, if you want to get an impression of the technology (including a very impressive C# IDE in the browser).
IBM
For the last part of the trip I stayed in New York City for two weeks, mostly as a vacation. However, I also visited IBM Watson Research Center for two days, to which I had been invited by David Edelsohn.
The first day I gave the same presentation I had given at Microsoft (with some improvements to the slides), again it was quite well received. The rest of the time I spent in (very fruitful) discussions with various people and teams, among them the Liquid Metal team and the Thorn team.
The second day I met with members of the FIORANO group, who are working on dynamic compilation for dynamic languages and Java. They explored various ways to speed up Python, both by improving the CPython interpreter as well as with JIT compilation techniques.
Another of their projects is to add a trace compiler to IBM's J9 JVM, about which the paper "A Trace-based Java JIT Compiler Retrofitted from a Method-based Compiler" is going to appear at CGO. I discussed tracing JITs with Peng Wu, one of the authors of that paper. Peng tries to systematically look at the various heuristics found in the different VMs that use tracing JITs. This is a very different perspective from the one I usually have, focusing on how to improve PyPy's specific heuristics. Therefore that discussion helped me thinking about the issues more generally.
Another goal of the group is to try to find benchmarks that are representative for typical Python workloads, which is something that has been done very carefully for Java e.g. when developing the DaCapo benchmark suite. The benchmarks that the Python community uses have not been selected in such a careful and measured way, so I think that trying to be more systematic there is a very worthwhile endeavour.
PyPy Winter Sprint Report
A few weeks ago I had the great fortune to attend the PyPy winter sprint in Leysin Switzerland. I've wanted to contribute to PyPy for a long time and I thought diving into a sprint might be a good way to get familiar with some of the code. What I wasn't expecting was to be using RPython to implement new methods on built-in Python objects on the first day. The main thing I took away from the sprint was just how easy it is to get involved in developing PyPy (well, some bits of it at least and being surrounded by core developers helps). I wrote up a very short description of how to get started here, but I'll do a longer blog post with examples on my own blog soon(ish).
The sprint was kicked off by Armin merging the "fast-forward" branch of PyPy onto trunk. "fast-forward" brings PyPy from Python 2.5 compatibility to Python 2.7. Along with this it brought a large number of test failures, as the sterling work done by Benjamin Peterson and Amaury Forgeot d'Arc was not complete. This immediately set the primary sprint goal to reduce the number of test failures.
We made a great deal of progress on this front, and you can see how close PyPy is now from the buildbots.
Jacob Hallén and I started working through the list of tests with failures alphabetically. We made short work of test_asyncore and moved onto test_bytes where I was stuck for the rest of the sprint. I spent much of the remaining days working with Laura Creighton on the pypy bytearray implementation to make it more compatible with Python 2.7. This meant adding new methods, changing some of the Python protocol method implementations and even changing the way that bytearray is constructed. All in all great fun and a great introduction to working with RPython.
A big part of the compatibility with Python 2.7 work was done by Laura and Armin who basically rewrote the math module from scratch. This was needed to incorporate all the improvements made (mostly by Mark Dickinson) in CPython in 2.7. That involved a lot of head-scratching about such subtleties as whether -0.0 should be considered almost equal to 0.0 and other fun problems.
The first meal together, before everyone had arrived |
View of the mountains from the sprint |
Working on 2.7 compatibility wasn't the only work that happened during the sprint. Other activities included:
- Antonio Cuni worked on the "jittypes" branch. This is a reimplementation of the core of the PyPy ctypes code to make it jittable. The goal is that for common cases the jit should be able to turn ctypes calls from Python into direct C level calls. This work was not completed but very close and is great for the future of integrating C libraries with PyPy. As ctypes is also available in CPython and IronPython, and hopefully will be available in Jython soon, integrating C code with Python through ctypes is the most "implementation portable" technique.
- David Schneider continued his work on the JIT backend for ARM. PyPy has been cross-compilable to ARM for a long time, but bringing the JIT to ARM will provide a *fast* PyPy for ARM, which includes platforms like Android. Again David didn't complete this work but did complete the float support.
- Håkan Ardo was present for two days and continued his crazy-clever work on JIT optimisations, some of which are described in the Loop invariant code motion blog entry.
- Holger Krekel worked on updating the PyPy test suite to the latest version of py.test and also worked with me on the interminable bytearray changes for part of the sprint.
- No one was sure what Maciej Fijałkowski worked on but he seemed to be quite busy.
"There was also a great deal of healthy discussion about the future of PyPy."
World domination?
> world domination?
why yes of course! the ouroboros is their symbol; PyPy is, evidently, backed by the templars
> world domination?
Mongol General: Pypy devs! What is best in life?
Pypy dev: To crush your enemies, see them driven before you, and to hear the lamentation of their women.
Mongol General: That is good! That is good.
@Anonymous: Let's not get too far off-track. Also, I don't really like being ascribed a rather violent quote by (supposedly) Genghis Khan, so stop that please.
@Carl, it wasn't Genghis Khan.
It was Conan the Barbarian, impersonated by former California governor.
Not to be taken too seriously... :-)
The PyPy San Franciso Bay Area Tour 2011
PyPy is coming to the San Francisco Bay Area in the beginning of March with a series of talks and a mini sprint.
-
Wednesday March 2, 4:15 p.m. Armin Rigo gives a talk at Stanford. open to the public.
-
Thursday March 3, 6:00 p.m. General talk at Yelp, 706 Mission St 9th Floor, San Francisco CA 94103 open to the public.
-
Saturday and Sunday March 5 and 6. PyPy mini sprint at noisebridge. 2169 Mission street between 17th and 18th in San Francisco. Open to the public.
-
Monday March 7th, 11:30 a.m. Google Tech talk in Mountain View at the Googleplex. Not open to the public (but the video should be available later).
-
Monday March 7th, 2:30 p.m. Talk at Mozilla in Mountain View. Not open to the public (but Mozilla developers can videoconference).
From the PyPy project team we will have Armin Rigo, Maciej Fijałkowski (from 6th March), Laura Creighton and Jacob Hallén and possibly Christian Tismer attending.
Most of the talks will focus on (some of) the highlights and the status of pypy:
- most Python benchmarks run much faster than with CPython or Psyco
- the real-world PyPy compiler toolchain itself (200 KLocs) runs twice as fast
- supports x86 32 and 64bit and is in the process of supporting ARM
- full compatibility with CPython (more than Jython/IronPython)
- full (and JIT-ed) ctypes support to call C libraries from Python
- supports Stackless Python (in-progress)
- new "cpyext" layer which integrates existing CPython C extensions
- an experimental super-fast JIT-compilation of calls to C++ libraries
As is usual for us, there is vastly more material that is available for us to cover than time, especially when it comes to possible future directions for PyPy. We want to reserve a certain amount of time at each talk purely to discuss things that are of interest to audience members. However, if you already know what you wish we would discuss, and are attending a talk (or even if you aren't), please let us know. You can either reply to this blog post, or mail Laura directly at lac at openend.se .
Apart from getting more technical and project insight, our travel is also a good possibility for companies in the SF area to talk to us regarding contracting. In September 2011 our current "Eurostars" research project ends and some of us are looking for ways to continue working on PyPy through consulting, subcontracting or hiring. The two companies, Open End and merlinux, have successfully done a number of such contracts and projects in the past. If you want to talk business or get together for lunch or dinner, let us know! If you would like us to come to your company and make a presentation, let us know! If you have any ideas about what we should discuss in a presentation so that you could use it to convince the powers-that-be at your place of employment that investing time and money in PyPy would be a good idea, let us know!
On Tuesday March 8th we will be heading for Atlanta for the Python VM and Language Summits before attending PyCon. Maciej Fijałkowski and Alex Gaynor will be giving a talk entitled Why is Python slow and how can PyPy help? Maciej will also be giving the talk Running ultra large telescopes in Python which is partially about his experiences using PyPy in the Square Kilometer Array project in South Africa. There will be a PyPy Sprint March 14-17. All are welcome.
I wanted to let everyone know, there is a PSF sponsored code sprint in Portland, Oregon on February 26th starting at 9am. If you're going to be in the area, it promises to be a great time. We've got a great plan for the day which can be see in this google doc. I hope to see some of you there!
--Dan
We'll be giving a talk at Dropbox in San Francisco at 16:00 on Friday March 4th.
PyPy faster than C on a carefully crafted example
Good day everyone.
Recent round of optimizations, especially loop invariant code motion has been very good for small to medium examples. There is work ongoing to make them scale to larger ones, however there are few examples worth showing how well they perform. This one following example, besides getting benefits from loop invariants, also shows a difference between static and dynamic compilation. In fact, after applying all the optimizations C does, only a JIT can use the extra bit of runtime information to run even faster.
The example is as follows. First Python. I create two files, x.py:
def add(a, b): return a + b
And y.py:
from x import add def main(): i = 0 a = 0.0 while i < 1000000000: a += 1.0 add(a, a) i += 1 main()
For C, x.c:
double add(double a, double b) { return a + b; }
and y.c:
double add(double a, double b); int main() { int i = 0; double a = 0; while (i < 1000000000) { a += 1.0; add(a, a); i++; } }
Results?
- 1.97s - PyPy
- 3.07s - C
- PyPy trunk (386ed41eae0c), running pypy-c y.py
- C - gcc -O3 (GCC 4.4.5 shipped with Ubuntu Maverick)
Hence, PyPy 50% faster than C on this carefully crafted example. The reason is obvious - static compiler can't inline across file boundaries. In C, you can somehow circumvent that, however, it wouldn't anyway work with shared libraries. In Python however, even when the whole import system is completely dynamic, the JIT can dynamically find out what can be inlined. That example would work equally well for Java and other decent JITs, it's however good to see we work in the same space :-)
Cheers,
fijal
EDIT: Updated GCC version
> The reason is obvious - static compiler can't inline across file boundaries.
That's what link-time optimizations are for, which where added to GCC in 2009; however, your point concerning shared libaries is valid...
I added a printf("%f\n",a) to the end of the file so the compiler wouldn't optimize the whole thing away. On my Cure 2 Duo 2.33Ghz, I got for gcc -O3:
1000000000.000000
real 0m4.396s
user 0m4.386s
sys 0m0.007s
and for gcc -O3 -flto -fwhole-program:
1000000000.000000
real 0m1.312s
user 0m1.308s
sys 0m0.003s
Great work!
Now you just have to identify and remove dead code in your jit. Then you could remove the call to 'add' altogether.
In this strange example, in our JIT, the call to 'add' is indeed removed because of inlining, and then the addition that occurs in there is removed because of dead code elimination.
@Zeev yes, but C equivalent of Python import is indeed shared libraries, where -fwhole-program no longer works.
@Armin note that even when the result is accumulated (addition is not removed, although the call is still inlined), PyPy is still faster. Not as much though: 2.5s vs 3.0s
For completeness's sake, what's the output of `gcc --version` in your example?
Not to mention specialization: python's (and pypy's) add() can add pretty much anything - strings if you will.
The JIT will inline a specialized version particular to the call site, whereas C can only apply generalized optimizations.
There's another simple case where pypy could (in principle) do very much better than standard C: turn pow(x, i) into sqrt(x*x*x) if i == 3/2, and other reductions. In practice if you don't know what i is at compiletime you often bundle the simplifications into a function (at the cost of some ifs) but a JIT could do a very nice job on this automagically whenever i is fixed, which it usually is.
You wrote: "PyPy 50% faster than C on this carefully crafted example".
The truth is: PyPy is 35% faster than the C code (using C as the baseline), because it completes in 65% of the time required by the C version.
The C code takes 50% more time to execute (is slower by 50%, 1.5x slower) than the PyPy code (using PyPy as the baseline).
Test with gcc (Debian 20110126-0ubuntu1) 4.6.0 20110126 (experimental) [trunk revision 169285]: "/usr/lib/gcc-snapshot/bin/gcc [OPTIONS] x.c y.c -o x && time ./x". OPTIONS=-O0: 10.1s; OPTIONS=-O3: 9.1s; OPTIONS=-O3 -flto: 0.002s. Woops, 0.002 second? I checked: the result is correct :-) LTO rocks!
@haypo print the result so the loop don't get removed as dead code. Besides, the problem is really the fact that's -flto is unfair since python imports more resemble shared libraries than statically-compiled files.
In general, if you want to compare the performance of languages, you're actually supposed to try to write the *fastest* implementation in each language. Not just some arbitrary one.
In this example, the program has no output, so both implementations are crap and could be made a lot faster.
Come up with a program that has testable output, and see if someone can't comment with a C program that's faster than your python.
Pypy isn't faster than C, even on this example for multiple reasons:
First it's conceptual: C is almost as optimized as assembly (it's often referred to as a super assembler) so even if Pypy ends-up generating some assembly code, it has first to evaluate the runtime environment to figure out the type of variables and emit assembly code, and all this process is not free... so Pypy can only asymptotically reach the same level as C and assembly.
Second, the test is flawed: I did a slight modification that shouldn't change the results: I've inlined the add() in both python and C. Oh! surprise: Pypy keeps the same time whereas C is 4x faster than before (without inlining).
So to make it fair, we need to use the best capabilities of both languages:
- python: I'm sure the author provided the best python implementation, and the fact that inlining add() doesn't change results kinda proves this)
- C: when you inline the function you get:
[code]
static inline double add_double(double a, double b) {
return a + b;
}
int main()
{
unsigned int i;
double a = 0.0;
for (i = 0; i < N; i++) {
a += 1.0;
add_double(a, a);
}
printf("%f\n", a);
}
[/code]
Results:
C inlined: 1.10s
C: 3.98s
Pypy inlined: 3.30s
Pypy: 3.28s
Conclusion:
- When using the right C code, on the same example C is 3 times faster than Pypy.
- As demonstrated, the statement that Pypy is faster than C is simply biased by a not optimizsed C code.
@Eric This post is not trying to argue that Python is "better" or even faster than C. It is just pointing out that certain classes of optimizations (i.e. whole program optimizations) come naturally to the PyPy JIT.
This is, of course, only one small facet of why a program runs fast. The author admits that it is a contrived example to illustrate the point.
Taking the point to an extreme, one could see a PyPy program run faster than a C program if the C program made many calls to simple shared libraries. For example, if one dynamically links a C stdlib into their program, and uses it heavily, the equivalent python code may conceivably run faster.
Please read the title of this article again: "PyPy faster than C on a carefully crafted example"
Based on a specific example or not it doesn't matter, I'm simply not comfortable with reading strong statement like this that are obvioulsy false to any serious computer scientist and misleading to beginners. It's false because it's the conclusion of a test which is biased.
The root of benchmarking is to get rid of any bias
In this case the obvious bias is that Pypy is optimized and C isn't (as demonstrated above with inline functions).
You can't transpose only what you want in real life and not the other: your argument that in real life the C could use external library hence be slower is valid, but then you have to compare with real life Python scripts which can't be as much optimized by Pypy as this crafted example. So in real life you get a C code that may be slowed down a bit by dynamic linking, and python scripts that are much slower because Pypy isn't ready to match C speed for everything (yet).
If you want to use a crafted Python example, you have to compare it to a crafted C example, so that you can compare apples with apples.
All that is methodology, that said JIT is quite powerful and it's impressive in itself to beat CPython by a large margin.
Eric: Your comments about "real life" are irrelevant - the post is about a specific, contrived example. I don't think anyone would argue that a high-level, garbage-collected language like python could ever beat out C in general - it's simply a demonstration that, in a very specific instance, equivalent code in python and C can run faster in python because of the JIT making optimizations that can't occur at compile time.
You're assuming that python is faster even on this crafted example, but keep in mind that this comparison is biased because the C version isn't optimal.
you're assuming that python is faster even on this crafted example, but keep in mind that this comparison is biased because the C version isn't optimal.
point taken, but do update the article to take into account my remark: both the title and the conclusion of the "demonstration" are false, even on a contrived example as you barely can't find any C code that would be slower than the code generated by your JIT for the simple reason that C is really too close to assembly and that JIT adds an overhead.
Hey Eric.
Your argument is incredibly flawed. You can compile faster version of assembler (or is C the fastest assembler ever?) if you try hard enough. Why not?
Please don't digress, what I say is simple:
The article states that Pypy generates code faster than C on a crafted example.
I demonstrate there is a more optimized C code that the author's one, hence that the whole article is wrong... end of the story.
No, it's a reasonable piece of C. You don't inline your printf code, do you? dynamic linking is a thing that people use.
You're right, people very often use dynamic linking. However the following is not a reasonable piece of Python code:
def add(a, b): return a + b
People rarely use that and more importantly they don't write a loop that calls it 1 billion times.
The point is that the reasoning spans two levels (hence is flawed/biased):
- in Python the author took a crafted piece of Python that is not meaningful in real life because it has the property to do what he wants at the Pypy level
- in C the author uses a very common mechanism that isn't fully optimized (not as much as Python/Ppy is optimized).
I know you will not agree since you're all proud that "Pypy is faster than C" (lol it's nonsense even on a "crafted example") but you have to compare apples with apples.
@Eric what you don't understand is the point of the article. The actual point is to demonstrate a nice property of PyPy JIT, which is able to generate fast code when it can. Comparing to C in this manner proves that PyPy's generated machine code is relevant with regard to speed.
Of course this example is fragile because it relies on suboptimal C code, but this serves only to prove the point about PyPy.
A JIT Backend for ARM Processors
ARM processors are very widely used, beeing deployed in servers, some netbooks and mainly mobile devices such as phones and tablets. One of our goals is to be able to run PyPy on phones, specially on Android. Currently is not yet possible to translate and compile PyPy for Android automatically, but there has been some work on using Android's NDK to compile PyPy's generated C code.
The JIT Backend targets the application profile of the ARMv7 instruction set architecture which is found for example in the Cortex-A8 processors used in many Android powered devices and in Apple's A4 processors built into the latest iOS devices. To develop and test the backend we are using a BeagleBoard-xM which has a 1 GHz ARM Cortex-A8 and 512 MB of RAM running the ARM port of Ubuntu 10.10.
Currently on Linux it is possible to translate and cross-compile PyPy's Python interpreter as well as other interpreters with the ARM JIT backend enabled using Scratchbox 2 to provide a build environment and the GNU ARM cross compilation toolchain. So far the backend only supports the Boehm garbage collector which does not produce the best results combined with the JIT, but we plan to add support for the other GCs in the future, doing so should increase the performance of PyPy on ARM.
While still debugging the last issues with the backend we already can run some simple benchmarks on Pyrolog, a prolog interpreter written in RPython. Even using Boehm as the GC the results look very promising. In the benchmarks we compare Pyrolog to SWI-Prolog, a prolog interpreter written in C, which is available from the package repositories for Ubuntu's ARM port.
The benchmarks can be found in the pyrolog-bench repository.
Benchmark | SWI-Prolog in ms. | Pyrolog in ms. | Speedup |
---|---|---|---|
iterate | 60.0 | 6.0 | 10.0 |
iterate_assert | 130.0 | 6.0 | 21.67 |
iterate_call | 3310.0 | 5.0 | 662.0 |
iterate_cut | 60.0 | 359.0 | 0.16713 |
iterate_exception | 4950.0 | 346.0 | 14.306 |
iterate_failure | 400.0 | 127.0 | 3.1496 |
iterate_findall | 740.0 | No res. | |
iterate_if | 140.0 | 6.0 | 23.333 |
For simple benchmarks running on PyPy's Python intepreter we see some speedups over CPython, but we still need to debug the backend bit more before we can show numbers on more complex benchmarks. So, stay tuned.
Awesome stuff. I have a panda board and another xm that's usually not doing much if you want to borrow some cycles :-)
When you support floats will you be aiming for hard float? It's the way of the future, I hear...
I am curious if you had any use for ThumbEE (or Jazelle RCT) to speed up?
@mwhudson: thanks it would be great to be able to test on more hardware.
For the float support we still need to investigate a bit, but if possible I would like to target hard floats.
@dbrodie: currently we are targeting the arm state, so not at the moment.
One would imagine conserving memory would be an important factor on mobile devices. Even though mobile devices have a growing amount of memory available, it will still be less than desktops for the forseeable future. Memory pressure can create real slowdowns.
A JIT normally takes more memory, but on the other hand PyPy offers features to reduce usage of memory. Could you share some of your thinking on this?
Martijn: you are describing the situation as well as we (at least I) know it so far: while PyPy has in many cases a lower non-JIT memory usage, the JIT adds some overhead. But it seems to be within ~200MB on "pypy translate.py", which is kind of the extreme example in hugeness. So already on today's high-end boards with 1GB of RAM, it should easily fit. Moreover it can be tweaked, e.g. it's probably better on these systems to increase the threshold at which JITting starts (which also reduces the number of JITted code paths). So I think that the possibility is real.
Showing speedups over repetitive instructions (which caching & JIT are really good at) is irrelevant.
What happens when people use real benchmarks, like constraint-based solvers and non-iterative stuff (maybe take a look at the other benchmarks) ...
Prolog is a declative language, not a sysadmin scripting language.
Also, the SWI implementation adds so many functionalities, it's like making a «Extract chars from an RDBMS vs Text files» benchmark.
@Dan
Why are you so defensive? This benchmark is clearly not about how fast Pyrolog is, but how the ARM JIT backend performs, using trivial Prolog microbenchmarks, with SWI to give a number to compare against.
Pyrolog is a minimal Prolog implementation that is (at least so far) mostly an experiment to see how well PyPy's JIT technology can do on an non-imperative language. This paper contains more interesting benchmarks:
https://portal.acm.org/citation.cfm?id=1836102
Hi,
Is there a way to cross compile on a host machine (but not with scratch box) where I have tool chain and file system for the target?
Any instructions for building with arm back-end?
Cheers
@jamu: scratchbox 2 is currently the only option to cross-translate pypy for ARM. You can find some documentation about the cross translation at https://foss.heptapod.net/pypy/pypy/-/tree/branch/arm-backend-2/pypy/doc/arm.rst
PyPy wants you!
If you ever considered contributing to PyPy, but never did so far, this is a good moment to start! :-)
Recently, we merged the fast-forward branch which brings Python 2.7 compatibility, with the plan of releasing a new version of PyPy as soon as all tests pass.
However, at the moment there are still quite a few of failing tests because of new 2.7 features that have not been implemented yet: many of them are easy to fix, and doing it represents a good way to get confidence with the code base, for those who are interested in it. Michael Foord wrote a little howto explaining the workflow for running lib-python tests.
Thus, if you are willing to join us in the effort of having a PyPy compatible with Python 2.7, probably the most sensible option is to come on the #PyPy IRC channel on Freenode, so we can coordinate each other not to fix the same test twice.
Moreover, if you are a student and are considering participating in the next Google Summer of Code this is a good time to get into pypy. You have the opportunity to get a good understanding of pypy for when you decide what you would like to work on over the summer.
Would you mind giving us a hint of what skills programmers would need to be actually useful? I know you don't want to scare anybody off, but PyPy is kind of the ultimate evolution of what you can do with the language, and I get the sense (perhaps wrongly!) that it goes places where desktop-and-web-app guys like me are a bit out of our depth and actually might waste time more than anything else.
I'm asking this here because I'm pretty sure that others are going to be thinking the same thing.
Seems a lot of volantiers applied - buildbot.pypy.org renders 502 Proxy Error
Nofrak: you ask good questions. I'd say you need to know your way around Python programming in general which you most certainly do if you have done desktop or Web apps in Python.
Secondly, it's important to know a bit about the basic structure of an Python interpreter. Reading some docs, among them Chapter 1 of https://codespeak.net/pypy/trunk/pypy/doc/coding-guide.html#overview-and-motivation should help.
Thirdly, methodology: PyPy is written in a test-driven way, and for the Python interpreter there are several places for tests: one is the (sometimes slightly modified) standard CPython tests in the lib-python/(modified-)2.7.0 directory, another is pypy/objspace/std/test. The implementation of the interpreter mainly is written down in pypy/objspace/std/*.py.
Hope that helps a bit. IRC is a good place to ask for further directions, of course.
And then what do we do after fixing a failing test case? For each patch, create a new bug in the bug tracker and attach it?
@Anonymous: creating a new issue in the bug tracker is not necessary: you can just come on IRC or write to pypy-dev attaching your patch, or you can e.g. fork the project on bitbucket and send a pull request, or you can send us the mercurial bundle, etc. etc.
There is no really any bureaucracy for this :)
Loop invariant code motion
Recently, the jit-unroll-loops branch was merged. It implements the idea described in Using Escape Analysis Across Loop Boundaries for Specialization. That post does only talk about virtuals, but the idea turned out to be more far reaching. After the metainterpreter produces a trace, several optimizations are applied to the trace before it is turned into binary code. Removing allocations is only one of them. There are also for instance
- Heap optimizations that removes memory accesses by reusing results previously read from or written to the same location.
- Reusing of the results of pure operations if the same pure operation is executed twice.
- Removal of redundant guards.
- ...
This is achieved by unrolling the trace into two iterations, and letting the optimizer work on this two-iteration-trace. The optimizer will now be able to optimize the second iteration more than the first since it can reuse results from the first iteration. The optimized version of the first iteration we call the preamble and the optimized version of the second iteration we call the loop. The preamble will end with a jump to the loop, while the loop will end with a jump to itself. This means that the preamble will be executed once for the first iteration, the loop will be executed for all following iterations.
Sqrt example
Here is an example of a Python implementation of sqrt using a fairly simple algorithm
def sqrt(y, n=10000):
x = y / 2
while n > 0:
n -= 1
x = (x + y/x) / 2
return x
If it is called with sqrt(1234.0), a fairly long trace is produced. From this trace the optimizer creates the following preamble (Loop 1) and loop (Loop 0)
Looking at the preamble, it starts by making sure that it is not currently being profiled, the guard on i5, and that the function object have not been changed since the trace was made, the guard on p3. Somewhat intermixed with that, the integer variable n is unboxed, by making sure p11 points to an integer object and reading out the integer value from that object. These operations are not needed in the loop (and have been removed from it) as emitting the same guards again would be redundant and n becomes a virtual before the end of the preamble.
guard_value(i5, 0, descr=<Guard6>) guard_nonnull_class(p11, ConstClass(W_IntObject), descr=<Guard7>) guard_value(p3, ConstPtr(ptr15), descr=<Guard8>) i16 = getfield_gc_pure(p11, descr=<W_IntObject.inst_intval>)Next comes a test and a guard implementing the while statement followed by the decrementing of n. These operation appear both in the preamble and in the loop
i18 = int_gt(i16, 0) guard_true(i18, descr=<Guard9>) i20 = int_sub(i16, 1)After that the two floating point variables x and y are unboxed. Again this is only needed in the preamble. Note how the unboxed value of y, called f23, is passed unchanged from the preamble to the loop in arguments of the jump to allow it to be reused. It will not become a virtual since it is never changed within the loop.
guard_nonnull_class(p12, 17652552, descr=<Guard10>) guard_nonnull_class(p10, 17652552, descr=<Guard11>) f23 = getfield_gc_pure(p10, descr=<W_FloatObject.inst_floatval>) f24 = getfield_gc_pure(p12, descr=<W_FloatObject.inst_floatval>)Following that is the actual calculations performed in the loop in form of floating point operations (since the function was called with a float argument). These appear in both the loop and the preamble.
i26 = float_eq(f24, 0.000000) guard_false(i26, descr=<Guard12>) f27 = float_truediv(f23, f24) f28 = float_add(f24, f27) f30 = float_truediv(f28, 2.000000)Finally there are some tests checking if a signal was received (such as when the user presses ctrl-C) and thus should execute some signal handler or if we need to hand over to another thread. This is implemented with a counter that is decreased once every iteration. It will go below zero after some specific number of iterations, tunable by sys.setcheckinterval. The counter is read from and written to some global location where it also can be made negative by a C-level signal handler.
i32 = getfield_raw(32479328, descr=<pypysig_long_struct.c_value>) i34 = int_sub(i32, 2) setfield_raw(32479328, i34, descr=<pypysig_long_struct.c_value>) i36 = int_lt(i34, 0) guard_false(i36, descr=<Guard13>) jump(p0, p1, p2, p4, p10, i20, f30, f23, descr=<Loop0>)
Bridges
When a guard fails often enough, the meta-interpreter is started again to produce a new trace starting at the failing guard. The tracing is continued until a previously compiled loop is entered. This could either be the the same loop that contains the failing guard or some completely different loop. If it is the same loop, executing the preamble again maybe be unnecessary. It is preferable to end the bridge with a jump directly to the loop. To achieve this the optimizer tries to produce short preambles that are inlined at the end of bridges allowing them to jump directly to the loop. Inlining is better than jumping to a common preamble because most of the inlined short preamble can typically be removed again by the optimizer. Creating such a short preamble is however not always possible. Bridges jumping to loops for which no short preamble can be generated have to end with a jump to the full preamble instead.The short preamble is created by comparing the operations in the preamble with the operations in the loop. The operations that are in the preamble but not in the loop are moved to the short preamble whenever it is safe to move them to the front of the operations remaining. In other words, the full preamble is equivalent to the short preamble followed by one iteration of the loop.
This much has currently been implemented. To give the full picture here, there are two more features that hopefully will be implemented in the near future. The first is to replace the full preamble, used by the interpreter when it reaches a compiled loop, with the short preamble. This is currently not done and is probably not as straight forward as it might first seem. The problem is where to resume interpreting on a guard failure. However, implementing that should save some memory. Not only because the preamble will become smaller, but mainly because the guards will appear either in the loop or in the preamble, but not in both (as they do now). That means there will only be a single bridge and not potentially two copies once the guards are traced.
The sqrt example above would with a short preamble result in a trace like this
If it is executed long enough, the last guard will be traced to form a bridge. The trace will inherit the virtuals from its parent. This can be used to optimize away the part of the inlined short preamble that deals with virtuals. The resulting bridge should look something like
[p0, p1, p2, p3, p4, f5, i6] i7 = force_token() setfield_gc(p1, i7, descr=<PyFrame.vable_token>) call_may_force(ConstClass(action_dispatcher), p0, p1, descr=<VoidCallDescr>) guard_not_forced(, descr=<Guard19>) guard_no_exception(, descr=<Guard20>) guard_nonnull_class(p4, 17674024, descr=<Guard21>) f52 = getfield_gc_pure(p4, descr=<W_FloatObject.inst_floatval>) jump(p1, p0, p2, p3, p4, i38, f53, f52, descr=<Loop0>)Here the first paragraph comes from the traced bridge and the second is what remains of the short preamble after optimization. The box p4 is not a virtual (it contains a pointer to y which is never changed), and it is only virtuals that the bridge inherit from it's parents. This is why the last two operations currently cannot be removed.
Each time the short preamble is inlined, a new copy of each of the guards in it is generated. Typically the short preamble is inlined in several places and thus there will be several copies of each of those guards. If they fail often enough bridges from them will be traced (as with all guards). But since there typically are several copies of each guard the same bridge will be generated in several places. To prevent this, mini-bridges from the inlined guards are produced already during the inlining. These mini-bridges contain nothing but a jump to the preamble.
The mini-bridges needs the arguments of the preamble to be able to jump to it. These arguments contain among other things, boxed versions of the variables x and y. Those variables are virtuals in the loop, and have to be allocated. Currently those allocations are placed in front of the inlined guard. Moving those allocations into the mini-bridges is the second feature that hopefully will be implemented in the near future. After this feature is implemented, the result should look something like
Multiple specialized versions
Floating point operations were generated in the trace above because sqrt was called with a float argument. If it is instead called with an int argument, integer operations will be generated. The somewhat more complex situations is when both int's and float's are used as arguments. Then the jit need to generate multiple versions of the same loop, specialized in different ways. The details, given below, on how this is achieved is somewhat involved. For the casual reader it would make perfect sense to skip to the next section here.Consider the case when sqrt is first called with a float argument (but with n small enough not to generate the bridge). Then the trace shown above will be generated. If sqrt is now called with an int argument, the guard in the preamble testing that the type of the input object is float will fail:
guard_nonnull_class(p12, 17652552, descr=<Guard10>)It will fail every iteration, so soon enough a bridge will be generated from this guard in the preamble. This guard will end with a jump to the same loop, and the optimizer will try to inline the short preamble at the end of it. This will however fail since now there are two guards on p12. One that makes sure it is an int and and one that makes sure it is a float. The optimizer will detect that the second guard will always fail and mark the bridge as invalid. Invalid loops are not passed on to the backend for compilation.
If a loop is detected to be invalid while inlining the short preamble, the metainterpreter will continue to trace for yet another iteration of the loop. This new trace can be compiled as above and will produce a new loop with a new preamble that are now specialized for int arguments instead of float arguments. The bridge that previously became invalid will now be tried again. This time inlining the short preamble of the new loop instead. This will produce a set of traces connected like this
(click for some hairy details)
The height of the boxes is this figure represents how many instructions they contain (presuming the missing features from the previous section are implemented). Loop 0 is specialized for floats and it's preamble have been split into two boxes at the failing guard. Loop 2 is specialized for ints and is larger than Loop 0. This is mainly because the integer division in python does not map to the integer division of the machine, but have to be implemented with several instructions (integer division in python truncates its result towards minus infinity, while the the machine integer division truncates towards 0). Also the height of the bridge is about the same as the height of Loop 2. This is because it contains a full iteration of the loop.
A More Advanced Example
Let's conclude with an example that is a bit more advanced, where this unrolling approach actually outperforms the previous approach. Consider making a fixed-point implementation of the square root using 16 bit's of decimals. This can be done using the same implementation of sqrt but calling it with an object of a class representing such fixed-point real numbers:
class Fix16(object):
def __init__(self, val, scale=True):
if isinstance(val, Fix16):
self.val = val.val
else:
if scale:
self.val = int(val * 2**16)
else:
self.val = val
def __add__(self, other):
return Fix16(self.val + Fix16(other).val, False)
def __sub__(self, other):
return Fix16(self.val - Fix16(other).val, False)
def __mul__(self, other):
return Fix16((self.val >> 8) * (Fix16(other).val >> 8), False)
def __div__(self, other):
return Fix16((self.val << 16) / Fix16(other).val, False)
Below is a table comparing the runtime of the sqrt function above with different argument types on different python interpreters. Pypy 1.4.1 was released before the optimizations described in this post were in place while they are in place in the nightly build from January 5, denoted pypy in the table. There are also the running time for the same algorithms implemented in C and compiled with "gcc -O3 -march=native". Tests were executed on a 2.53GHz Intel Core2 processor with n=100000000 iterations. Comparing the integer versions with C may be considered a bit unfair because of the more advanced integer division operator in python. The left part of this table shows runtimes of sqrt in a program containing a single call to sqrt (i.e. only a single specialized version of the loop is needed). The right part shows the runtime of sqrt when it has been called with a different type of argument before.
First call | Second call | ||||||
---|---|---|---|---|---|---|---|
float | int | Fix16 | float | int | Fix16 | ||
cpython | 28.18 s | 22.13 s | 779.04 s | 28.07 s | 22.21 s | 767.03 s | |
pypy 1.4.1 | 1.20 s | 6.49 s | 11.31 s | 1.20 s | 6.54 s | 11.23 s | |
pypy | 1.20 s | 6.44 s | 6.78 s | 1.19 s | 6.26 s | 6.79 s | |
gcc | 1.15 s | 1.82 s | 1.89 s | 1.15 s | 1.82 s | 1.89 s |
For this to work in the last case, when Fix16 is the argument type in the second type, the trace_limit had to be increased from its default value to prevent the metainterpreter from aborting while tracing the second version of the loop. Also sys.setcheckinterval(1000000) were used to prevent the bridge from being generated. With the bridge the performance of the last case is significantly worse. Maybe because the optimizer currently fails to generate a short preamble for it. But the slowdown seems too big for that to be the only explanation. Below are the runtimes numbers with checkinterval set to its default value of 100:
First call | Second call | ||||||
---|---|---|---|---|---|---|---|
float | int | Fix16 | float | int | Fix16 | ||
cpython | 28.71 s | 22.09 s | 781.86 s | 28.28 s | 21.92 s | 761.59 s | |
pypy 1.4.1 | 1.21 s | 6.48 s | 11.22 s | 1.72 s | 7.58 s | 12.18 s | |
pypy | 1.21 s | 6.27 s | 7.22 s | 1.20 s | 6.29 s | 90.47 s |
Conclusions
Even though we are seeing speedups in a variety of different small benchmarks, more complicated examples are not affected much by these optimizations. It might partly be because larger examples have longer and more complicated loops, and thus allowing optimizations to operate across loop boundary will have a smaller relative effect. Another problem is that with more complicated examples there will be more bridges, and bridges are currently not handled very well (most of the time all virtuals are forced at the end of the bridge as explained above). But moving those forcings into the mini bridges should fix that.Do you think you could fix the pictures?
I only see black images with a exclamation marks.
thanks
PyPy 1.4.1
Here is PyPy 1.4.1 :-)
Update: Win32 binaries available.
Enjoy!
Release announcement
We're pleased to announce the 1.4.1 release of PyPy. This release consolidates all the bug fixes that occurred since the previous release. To everyone that took the trouble to report them, we want to say thank you.
What is PyPy
PyPy is a very compliant Python interpreter, almost a drop-in
replacement for CPython. Note that it still only emulates Python
2.5 by default; the fast-forward
branch with Python 2.7
support is slowly getting ready but will only be integrated in
the next release.
In two words, the advantage of trying out PyPy instead of CPython (the default implementation of Python) is, for now, the performance. Not all programs are faster in PyPy, but we are confident that any CPU-intensive task will be much faster, at least if it runs for long enough (the JIT has a slow warm-up phase, which can take several seconds or even one minute on the largest programs).
Note again that we do support compiling and using C extension
modules from CPython (pypy setup.py install
). However, this
is still an alpha feature, and the most complex modules typically
fail for various reasons; others work (e.g. PIL
) but take a
serious performance hit. Also, for Mac OS X see below.
Please note also that PyPy's performance was optimized almost exclusively on Linux. It seems from some reports that on Windows as well as Mac OS X (probably for different reasons) the performance might be lower. We did not investigate much so far.
More highlights
- We migrated to Mercurial (thanks to Ronny Pfannschmidt and
Antonio Cuni) for the effort) and moved to bitbucket. The new
command to check out a copy of PyPy is:
hg clone https://bitbucket.org/pypy/pypy
- In long-running processes, the assembler generated by old JIT-compilations is now freed. There should be no more leak, however long the process runs.
- Improve a lot the performance of the
binascii
module, and ofhashlib.md5
andhashlib.sha
. - Made
sys.setrecursionlimit()
a no-op. Instead, we rely purely on the built-in stack overflow detection mechanism, which also gives you a RuntimeError -- just not at some exact recursion level. - Fix argument processing (now e.g.
pypy -OScpass
works like it does on CPython --- if you have a clue what it does there:-)
) - cpyext on Mac OS X: it still does not seem to work. I get systematically a segfault in dlopen(). Contributions welcome.
- Fix two corner cases in the GC (one in minimark, one in
asmgcc+JIT). This notably prevented
pypy translate.py -Ojit
from working on Windows, leading to crashes. - Fixed a corner case in the JIT's optimizer, leading to
Fatal RPython error: AssertionError
. - Added some missing built-in functions into the 'os' module.
- Fix ctypes (it was not propagating keepalive information from c_void_p).
Wow, and I thought 1.4.1 would come out after the january sprint!
A christmas present :->
What would be the focus of the january sprint then?
There are still a number of branches that have not been merged into trunk yet: at least fast-forward (Python 2.7), jit-unroll-loops (better JITting of arithmetic and short loops), arm-backend (JIT support on ARM) and jitypes2 (turn ctypes calls into real assembler-level calls with the JIT). There is also the stackless+JIT integration pending. Finally the sprint will also be a place to try out and run some applications. So it's not like we are out of work :-)
I'm interested in the performance improvement in hashlib.sha. I haven't seen that one before on https://speed.pypy.org . Could you give me more details?
Regards,
Zooko
Actually, hashlib.sha was not the same as sha.sha: the former used to be a ctypes call to the OpenSSL lib, whereas the latter uses our built-in sha implementation. So hashlib.sha was faster in theory, but killed by the overhead of using ctypes. Now, at least in a default version of pypy, the hashlib.md5 and .sha are redirected to the built-in md5.md5 and sha.sha.
Another issue was that with the built-in md5.md5 and sha.sha, on 64-bit, there was a 1.5x speed impact due to the C compiler not recognizing an expression that was meant to be a 32-bit integer rotation.
I guess that https://speed.pypy.org don't show this because they use directly md5.md5 or sha.sha, and are on 32-bit.
Thanks for PyPy 1.4.1. I reported two issues concerning buildout with PyPy 1.4, and they all got fixed!
So PyPy 1.4.1 is now compatible with buildout, which is really convenient as it makes it easy for me to test other projects.
I compiled 1.4.1 on Win32 using Visual C++ 2010.
Do you want to add it to the download page?
To whom shall I send it?
Happy new year.
Hello,
sorry, I'm a bit new here - is it possible that PyPy makes Python run in a browser? Somehow "translating" all the Python into Javascript?
I'm wondering because I saw you run, for example, CLI, so perhaps PyPy may somehow enable Python in a browser?
Andrei: not directly. We played at some point with translating RPython code to Javascript, but it didn't give enough benefits (because it's not full Python that we can translate, just "RPython"). The alternative would be to translate the whole PyPy interpreter to Javascript, but that would give a result that is both huge (in term of download size) and horribly slow (100x slower than Javascript maybe).
PyPy migrates to Mercurial
The assiduous readers of this blog surely remember that during the last Düsseldorf sprint in October, we started the process for migrating our main development repository from Subversion to Mercurial. Today, after more than two months, the process has finally been completed :-).
The new official PyPy repository is hosted on BitBucket.
The migration has been painful because the SVN history of PyPy was a mess and none of the existing conversion tools could handle it correctly. This was partly because PyPy started when subversion was still at version 0.9 when some best-practices were still to be established, and partly because we probably managed to invent all the possible ways to do branches (and even some of the impossible ones: there is at least one commit which you cannot do with the plain SVN client but you have to speak to the server by yourself :-)).
The actual conversion was possible thanks to the enormous work done by Ronny Pfannschmidt and his hackbeil tool. I would like to personally thank Ronny for his patience to handle all the various requests we asked for.
We hope that PyPy development becomes even more approachable now, at least from a version control point of view.
Awesome! Besides simplifying life for potential new contributors, it's very nice to be able to follow progress using the shortlog on bitbucket.org.
@Владимир: 9000? I count 459 on my local repo, which is still a lot, but not so much :-)
Anyway, most of them are closed, it's just that bitbucket displays also those. And I think that the huge number of branches is another evidence of the "we are not heroes" thing :-)
https://morepypy.blogspot.com/2010/12/we-are-not-heroes-just-very-patient.html
"PyPy is faster than CPython, again" should be the title. Faster at migrating to mercurial
:)
Great work, now pypy could be even more self hosting if it would run hg on it, when it becomes faster than cpython and stable to do so.
Oh, and btw: PyPy gets funding through "Eurostars"
There is a supporting reason why we made so many advances in the last year: funding through Eurostars, a European research funding program. The title of our proposal (accepted in 2009) is: "PYJIT - a fast and flexible toolkit for dynamic programming languages based on PyPy". And the participants are Open End AB, the Heinrich-Heine-Universität Düsseldorf (HHU), and merlinux GmbH.
It's not hard to guess what PYJIT is actually about, is it? Quoting: "The PYJIT project will deliver a fast and flexible Just-In-Time Compiler toolkit based on PyPy to the market of dynamic languages. Our main aim is to showcase our project's results for the Open Source language Python, providing unprecedented levels of flexibility and with speed hitherto only available using statically typed languages." (Details in German or in Swedish :-)
A subgoal is to improve our development and testing infrastructure, mainly showcased by Holger's recent py.test releases, the testing tool used by PyPy for its 16K tests and the speed.pypy.org infrastructure (web app programmed by Miquel Torres on his own time).
The overall scope of this project is smaller than that of the previous EU project from 2004 to 2007. The persons that are (or were) getting money to work on PyPy are Samuele Pedroni (at Open End), Maciej Fijalkowski (as a subcontractor), Carl Friedrich Bolz, Armin Rigo, Antonio Cuni (all at HHU), and Holger Krekel (at merlinux) as well as Ronny Pfannschmidt (as a subcontractor).
The Eurostars funding lasts until August 2011. What comes afterwards? Well, for one, many of the currently funded people have done work without getting funding in previous years. This will probably continue. We also have non-funded people in the core group right now and we'll hope to enlarge it further. But of course there are still large tasks ahead which may greatly benefit from funding. We have setup a donation infrastructure and maybe we can win one or more larger organisations to provide higher or regular sums of money to fund future development work. Another possibility for companies is to pay PyPy developers to help and improve PyPy for their particular use cases.
And finally, your help, donations and suggestions are always welcome and overall we hope to convince more and more people it's worthwhile to invest into PyPy's future.
Thanks for the interesting overview of your travels and research interactions! I i agree that getting better and more systematic benchmarks for Python would be worthwhile.
I find this project fascinating.
I wonder what's the theoretical limit of this approach for improving the performance of python (or any other language implemented in pypy)?
Do you have any rought estimation on how far you can go? Have you reached a limit or you are just scratching the possibilities?
For example, do you think you can compete with javascript v8 or luajit?
Hi Ivan.
In general I don't think there are limits of approach other than say time and money. Python is a complex language.
Can you come up with an example where PyPy is actually slower than V8 *other* than computer language shootout? Programs on computer language shootout are just not nicely optimized for PyPy.
Hi Fijall,
I'm afraid I don't know about benchmarks and comparison between these languages, other than the shootout. I guess this is the first reference someone gets when comparing languages, since it's the most popular out there.
But it would be great if there was a resource to compare against other languages. At least, from a marketing point of view, it would be very good for pypy.
May I know why the shootout is not a good parameter?
And, is there any other benchmarks comparing pypy against v8, tracemonkey/jägermonkey, etc..?
Hi Ivan.
Shootout is not good because it contains heavily tuned programs, some of them even massively stretching the benchmark restrictions. They're tailored towards specific implementations, contain specific per-benchmark options etc. Nobody looked at python programs at detail and especially from PyPy perspective. This would need to be done first to compare those fairly, until it's not done, it's comparing naive version to a heavily optimized one and not comparing languages.
From what I measured roughly PyPy comes on par with tracemonkey and about 2x slower V8. But those were very unscientific experiments and I'll deny everything :)
I don't think there is any good cross-language comparison and that's at least partly due to the fact that workloads differ in different languages. Most shootout programs for example are tailored towards C workloads. Optimizing precisely for them (even if you have a good programs) is kind of fun, but it does not represent what we try to achieve, that is speeding up large python programs.
I hope this answers your question.
Cheers,
fijal
to me it seems like you have reached the goals of unladen swallow and unladen swallow was a bit of a failure?
if google wants a faster python, why don't they fund you? it would be awesome if the core team could work on it full-time. :)