PyPy 2.2 - Incrementalism
We're pleased to announce PyPy 2.2, which targets version 2.7.3 of the Python language. This release main highlight is the introduction of the incremental garbage collector, sponsored by the Raspberry Pi Foundation.
This release also contains several bugfixes and performance improvements.
You can download the PyPy 2.2 release here:
https://pypy.org/download.htmlWe would like to thank our donors for the continued support of the PyPy project. We showed quite a bit of progress on all three projects (see below) and we're slowly running out of funds. Please consider donating more so we can finish those projects! The three projects are:
- Py3k (supporting Python 3.x): the release PyPy3 2.2 is imminent.
- STM (software transactional memory): a preview will be released very soon, as soon as we fix a few bugs
- NumPy: the work done is included in the PyPy 2.2 release. More details below.
What is PyPy?
PyPy is a very compliant Python interpreter, almost a drop-in replacement for CPython 2.7. It's fast (pypy 2.2 and cpython 2.7.2 performance comparison) due to its integrated tracing JIT compiler.This release supports x86 machines running Linux 32/64, Mac OS X 64, Windows 32, or ARM (ARMv6 or ARMv7, with VFPv3).
Work on the native Windows 64 is still stalling, we would welcome a volunteer to handle that.
Highlights
- Our Garbage Collector is now "incremental". It should avoid almost all pauses due to a major collection taking place. Previously, it would pause the program (rarely) to walk all live objects, which could take arbitrarily long if your process is using a whole lot of RAM. Now the same work is done in steps. This should make PyPy more responsive, e.g. in games. There are still other pauses, from the GC and the JIT, but they should be on the order of 5 milliseconds each.
- The JIT counters for hot code were never reset, which meant that a process running for long enough would eventually JIT-compile more and more rarely executed code. Not only is it useless to compile such code, but as more compiled code means more memory used, this gives the impression of a memory leak. This has been tentatively fixed by decreasing the counters from time to time.
- NumPy has been split: now PyPy only contains the core module, called _numpypy. The numpy module itself has been moved to https://bitbucket.org/pypy/numpy and numpypy disappeared. You need to install NumPy separately with a virtualenv: pip install git+https://bitbucket.org/pypy/numpy.git; or directly: git clone https://bitbucket.org/pypy/numpy.git; cd numpy; pypy setup.py install.
- non-inlined calls have less overhead
- Things that use sys.set_trace are now JITted (like coverage)
- JSON decoding is now very fast (JSON encoding was already very fast)
- various buffer copying methods experience speedups (like list-of-ints to int[] buffer from cffi)
- We finally wrote (hopefully) all the missing os.xxx() functions, including os.startfile() on Windows and a handful of rare ones on Posix.
- numpy has a rudimentary C API that cooperates with cpyext
Armin Rigo and Maciej Fijalkowski
Py3k status update #12
This is the 12th status update about our work on the py3k branch, which we
can work on thanks to all of the people who donated to the py3k proposal.
Here's an update on the recent progress:
- Thank you to everyone who has provided initial feedback on the PyPy3 2.1 beta
1 release. We've gotten a number of bug reports, most of which have been
fixed. - As usual, we're continually keeping up with changes from the default
branch. Oftentimes these merges come at a cost (conflicts and or
reintegration of py3k changes) but occasionally we get goodies for free, such
as the recent JIT optimizations and incremental garbage collection. - We've been focusing on re-optimizing Python 2 int sized (machine sized)
integers:
We have a couple of known, notable speed regressions in the PyPy3 beta release
vs regular PyPy. The major one being with Python 2.x int sized (or machine
sized) integers.
Python 3 drops the distinction between int and long types. CPython 3.x
accomplishes this by removing the old int type entirely and renaming the long
type to int. Initially, we've done the same for PyPy3 for the sake of
simplicity and getting everything working.
However PyPy's JIT is capable of heavily optimizing these machine sized integer
operations, so this came with a regression in performance in this area.
We're now in the process of solving this. Part of this work also involves some
house cleaning on these numeric types which also benefits the default branch.
cheers,
Phil
We should note that the re-optimization is different than CPython's. In the latter they use a "long" implementation which they heavily optimized for the common case of small integers. In PyPy instead we use two really different implementations (like "int" and "long" on Python 2); they just happen to be exposed at the user level with the same Python type in Python 3.
Making coverage.py faster under PyPy
If you've ever tried to run your programs with coverage.py under PyPy,
you've probably experienced some incredible slowness. Take this simple
program:
def f(): return 1 def main(): i = 10000000 while i: i -= f() main()
Running time coverage.py run test.py five times, and looking at the best
run, here's how PyPy 2.1 stacks up against CPython 2.7.5:
Python | Time | Normalized to CPython |
---|---|---|
CPython 2.7.5 | 3.879s | 1.0x |
PyPy 2.1 | 53.330s | 13.7x slower |
Totally ridiculous. I got turned onto this problem because on one of my
projects CPython takes about 1.5 minutes to run our test suite on the build
bot, but PyPy takes 8-10 minutes.
So I sat down to address it. And the results:
Python | Time | Normalized to CPython |
---|---|---|
CPython 2.7.5 | 3.879s | 1.0x |
PyPy 2.1 | 53.330s | 13.7x slower |
PyPy head | 1.433s | 2.7x faster |
Not bad.
Technical details
So how'd we do it? Previously, using sys.settrace() (which coverage.py
uses under the hood) disabled the JIT. Except it didn't just disable the JIT,
it did it in a particularly insidious way — the JIT had no idea it was being
disabled!
Instead, every time PyPy discovered that one of your functions was a hotspot,
it would start tracing to observe what the program was doing, and right when it
was about to finish, coverage would run and cause the JIT to abort. Tracing
is a slow process, it makes up for it by generating fast machine code at the
end, but tracing is still incredibly slow. But we never actually got to the
"generate fast machine code" stage. Instead we'd pay all the cost of tracing,
but then we'd abort, and reap none of the benefits.
To fix this, we adjusted some of the heuristics in the JIT, to better show it
how sys.settrace(<tracefunc>) works. Previously the JIT saw it as an opaque
function which gets the frame object, and couldn't tell whether or not it
messed with the frame object. Now we let the JIT look inside the
<tracefunc> function, so it's able to see that coverage.py isn't
messing with the frame in any weird ways, it's just reading the line number and
file path out of it.
I asked several friends in the VM implementation and research field if they
were aware of any other research into making VMs stay fast when debugging tools
like coverage.py are running. No one I spoke to was aware of any (but I
didn't do a particularly exhaustive review of the literature, I just tweeted at
a few people), so I'm pleased to say that PyPy is quite possibly the first VM
to work on optimizing code in debugging mode! This is possible because of our
years spent investing in meta-tracing research.
Happy testing,
Alex
No, you're not the first to make this pretentious mistake.
What's the report for code that was actually eliminated by optimizations? Was it covered? Was it not?
You misunderstand John Doe. The coverage report is for the user's Python code, which isn't optimized, eliminated, or otherwise modified. The PyPy speedups come from a clever reimplementation of the interpreter that runs the user's Python code, and this article was explaining how they found and fixed a big slowdown that happens to be triggered by a common test-related library.
@John Doe: sadly, we fail to understand exactly what part of the blog post you're answering to in your sentence "No, you're not the first to make this pretentious mistake". Can you please give more context and elaborate a bit?
@Armin: I believe John Doe is talking about the last paragraph as I believe the JVM also does not disable optimizations when using debug tools.
If this is the case than his comment is silly as Alex clearly stated he didn't do an exhaustive search.
Update on STM
Hi all,
The sprint in London was a lot of fun and very fruitful. In the last update on STM, Armin was working on improving and specializing the automatic barrier placement. There is still a lot to do in that area, but that work is merged now. Specializing and improving barrier placement is still to be done for the JIT.
But that is not all. Right after the sprint, we were able to squeeze the last obvious bugs in the STM-JIT combination. However, the performance was nowhere near to what we want. So until now, we fixed some of the most obvious issues. Many come from RPython erring on the side of caution and e.g. making a transaction inevitable even if that is not strictly necessary, thereby limiting parallelism. Another problem came from increasing counters everytime a guard fails, which caused transactions to conflict on these counter updates. Since these counters do not have to be completely accurate, we update them non-transactionally now with a chance of small errors.
There are still many such performance issues of various complexity left to tackle: we are nowhere near done. So stay tuned or contribute :)
Performance
Now, since the JIT is all about performance, we want to at least show you some numbers that are indicative of things to come. Our set of STM benchmarks is very small unfortunately (something you can help us out with), so this is not representative of real-world performance. We tried to minimize the effect of JIT warm-up in the benchmark results.
The machine these benchmarks were executed on has 4 physical cores with Hyper-Threading (8 hardware threads).
Raytracer from stm-benchmarks: Render times in seconds for a 1024x1024 image:
Interpreter | Base time: 1 thread | 8 threads (speedup) |
---|---|---|
PyPy-2.1 | 2.47 | 2.56 (0.96x) |
CPython | 81.1 | 73.4 (1.1x) |
PyPy-STM | 50.2 | 10.8 (4.6x) |
For comparison, disabling the JIT gives 148s on PyPy-2.1 and 87s on PyPy-STM (with 8 threads).
Richards from PyPy repository on the stmgc-c4 branch: Average time per iteration in milliseconds:
Interpreter | Base time: 1 thread | 8 threads (speedup) |
---|---|---|
PyPy-2.1 | 15.6 | 15.4 (1.01x) |
CPython | 239 | 237 (1.01x) |
PyPy-STM | 371 | 116 (3.2x) |
For comparison, disabling the JIT gives 492ms on PyPy-2.1 and 538ms on PyPy-STM.
Try it!
All this can be found in the PyPy repository on the stmgc-c4 branch. Try it for yourself, but keep in mind that this is still experimental with a lot of things yet to come. Only Linux x64 is supported right now, but contributions are welcome.
You can download a prebuilt binary from here: https://bitbucket.org/pypy/pypy/downloads/pypy-oct13-stm.tar.bz2 (Linux x64 Ubuntu >= 12.04). This was made at revision bafcb0cdff48.
Summary
What the numbers tell us is that PyPy-STM is, as expected, the only of the three interpreters where multithreading gives a large improvement in speed. What they also tell us is that, obviously, the result is not good enough yet: it still takes longer on a 8-threaded PyPy-STM than on a regular single-threaded PyPy-2.1. However, as you should know by now, we are good at promising speed and delivering it... years later :-)
But it has been two years already since PyPy-STM started, and this is our first preview of the JIT integration. Expect major improvements soon: with STM, the JIT generates code that is completely suboptimal in many cases (barriers, allocation, and more). Once we improve this, the performance of the STM-JITted code should come much closer to PyPy 2.1.
Cheers
Remi & Armin
To see a multithreading speed up in a python interpreter is awesome!
For next update, I would suggest to do the benchmarking turning off hyperthreading and measuring 1, 2 and 4 threads. That would give a better picture of how the STM implementation scales with threads/cores.
STM stands for Software Transactional Memory and is a way to run multiple non-conflicting tasks at the same time and make it appear as if they had run in sequence.
A bit off-topic, but just came across this paper:
"Speculative Staging for Interpreter Optimization
(...)
-- we report that our optimization makes the CPython interpreter up to more than four times faster, where our interpreter closes the gap between and sometimes even outperforms PyPy's just-in-time compiler."
https://arxiv.org/abs/1310.2300
Incremental Garbage Collector in PyPy
Hello everyone.
We're pleased to announce that as of today, the default PyPy comes with a GC that has much smaller pauses than yesterday.
Let's start with explaining roughly what GC pauses are. In CPython each object has a reference count, which is incremented each time we create references and decremented each time we forget them. This means that objects are freed each time they become unreachable. That is only half of the story though. First note that when the last reference to a large tree of objects goes away, you have a pause: all the objects are freed. Your program is not progressing at all during this pause, and this pause's duration can be arbitrarily large. This occurs at deterministic times, though. But consider code like this:
class A(object): pass a = A() b = A() a.item = b b.item = a del a del b
This creates a reference cycle. It means that while we deleted references to a and b from the current scope, they still have a reference count of 1, because they point to each other, even though the whole group has no references from the outside. CPython employs a cyclic garbage collector which is used to find such cycles. It walks over all objects in memory, starting from some known roots, such as type objects, variables on the stack, etc. This solves the problem, but can create noticeable, nondeterministic GC pauses as the heap becomes large and convoluted.
PyPy essentially has only the cycle finder - it does not bother with reference counting, instead it walks alive objects every now and then (this is a big simplification, PyPy's GC is much more complex than this). Although this might sound like a missing feature, it is really one of the reasons why PyPy is so fast, because at the end of the day the total time spent in managing the memory is lower in PyPy than CPython. However, as a result, PyPy also has the problem of GC pauses.
To alleviate this problem, which is essential for applications like games, we started to work on incremental GC, which spreads the walking of objects and cleaning them across the execution time in smaller intervals. The work was sponsored by the Raspberry Pi foundation, started by Andrew Chambers and finished by Armin Rigo and Maciej Fijałkowski.
Benchmarks
Everyone loves benchmarks. We did not measure any significant speed difference on our quite extensive benchmark suite on speed.pypy.org. The main benchmark that we used for other comparisons was translating the topaz ruby interpreter using various versions of PyPy and CPython. The exact command was python <pypy-checkout>/bin/rpython -O2 --rtype targettopaz.py. Versions:
- topaz - dce3eef7b1910fc5600a4cd0afd6220543104823
- pypy source - defb5119e3c6
- pypy compiled with minimark (non-incremental GC) - d1a0c07b6586
- pypy compiled with incminimark (new, incremental GC) - 417a7117f8d7
- CPython - 2.7.3
The memory usage of CPython, PyPy with minimark and PyPy with incminimark is shown here. Note that this benchmark is quite bad for PyPy in general, the memory usage is higher and the amount of time taken is longer. This is due to the JIT warmup being both memory hungry and inefficient (see below). But first, the new GC is not worse than the old one.
EDIT:Red line is CPython, blue is incminimark (new), green is minimark (old)
The image was obtained by graphing the output of memusage.py.
However, the GC pauses are significantly smaller. For PyPy the way to get GC pauses is to measure time between start and stop while running stuff with PYPYLOG=gc-collect:log pypy program.py, for CPython, the magic incantation is gc.set_debug(gc.DEBUG_STATS) and parsing the output. For what is worth, the average and total for CPython, as well as the total number of events are not directly comparable since it only shows the cyclic collector, not the reference counts. The only comparable thing is the amount of long pauses and their duration. In the table below, pause duration is sorted into 8 buckets, each meaning "below that or equal to the threshold". The output is generated using the gcanalyze tool.
CPython:
150.1ms | 300.2ms | 450.3ms | 600.5ms | 750.6ms | 900.7ms | 1050.8ms | 1200.9ms |
5417 | 5 | 3 | 2 | 1 | 1 | 0 | 1 |
PyPy minimark (non-incremental GC):
216.4ms | 432.8ms | 649.2ms | 865.6ms | 1082.0ms | 1298.4ms | 1514.8ms | 1731.2ms |
27 | 14 | 6 | 4 | 6 | 5 | 3 | 3 |
PyPy incminimark (new incremental GC):
15.7ms | 31.4ms | 47.1ms | 62.8ms | 78.6ms | 94.3ms | 110.0ms | 125.7ms |
25512 | 122 | 4 | 1 | 0 | 0 | 0 | 2 |
As we can see, while there is still work to be done (the 100ms ones could be split among several steps), we did improve the situation quite drastically without any actual performance difference.
Note about the benchmark - we know it's a pretty extreme case of JIT warmup, we know we suck on it, we're working on it and we're not afraid of showing PyPy is not always the best ;-)
Nitty gritty details
Here are some nitty gritty details for people really interested in Garbage Collection. This was done as a patch to "minimark", our current GC, and called "incminimark" for now. The former is a generational stop-the-world GC. New objects are allocated "young", which means that they initially live in the "nursery", a special zone of a few MB of memory. When the nursery is full, a "minor collection" step moves the surviving objects out of the nursery. This can be done quickly (a few millisecond) because we only need to walk through the young objects that survive --- usually a small fraction of all young objects; and also by far not all objects that are alive at this point, but only the young ones. However, from time to time this minor collection is followed by a "major collection": in that step, we really need to walk all objects to classify which ones are still alive and which ones are now dead ("marking") and free the memory occupied by the dead ones ("sweeping"). You can read more details here.
This "major collection" is what gives the long GC pauses. To fix this problem we made the GC incremental: instead of running one complete major collection, we split its work into a variable number of pieces and run each piece after every minor collection for a while, until there are no more pieces. The pieces are each doing a fraction of marking, or a fraction of sweeping. It adds some few milliseconds after each of these minor collections, rather than requiring hundreds of milliseconds in one go.
The main issue is that splitting the major collections means that the main program is actually running between the pieces, and so it can change the pointers in the objects to point to other objects. This is not a problem for sweeping: dead objects will remain dead whatever the main program does. However, it is a problem for marking. Let us see why.
In terms of the incremental GC literature, objects are either "white", "gray" or "black". This is called tri-color marking. See for example this blog post about Rubinius, or this page about LuaJIT or the wikipedia description. The objects start as "white" at the beginning of marking; become "gray" when they are found to be alive; and become "black" when they have been fully traversed. Marking proceeds by scanning grey objects for pointers to white objects. The white objects found are turned grey, and the grey objects scanned are turned black. When there are no more grey objects, the marking phase is complete: all remaining white objects are truly unreachable and can be freed (by the following sweeping phase).
In this model, the important part is that a black object can never point to a white object: if the latter remains white until the end, it will be freed, which is incorrect because the black object itself can still be reached. How do we ensure that the main program, running in the middle of marking, will not try to write a pointer to white object into a black object? This requires a "write barrier", i.e. a piece of code that runs every time we set a pointer into an object or array. This piece of code checks if some (hopefully rare) condition is met, and calls a function if that is the case.
The trick we used in PyPy is to consider minor collections as part of the whole, rather than focus only on major collections. The existing minimark GC had always used a write barrier of its own to do its job, like any generational GC. This existing write barrier is used to detect when an old object (outside the nursery) is modified to point to a young object (inside the nursery), which is essential information for minor collections. Actually, although this was the goal, the actual write barrier code is simpler: it just records all old objects into which we write any pointer --- to a young or old object. As we found out over time, doing so is not actually slower, and might actually be a performance improvement: for example, if the main program does a lot of writes into the same old object, we don't need to check over and over again if the written pointer points to a young object or not. We just record the old object in some list the first time, and that's it.
The trick is that this unmodified write barrier works for incminimark too. Imagine that we are in the middle of the marking phase, running the main program. The write barrier will record all old objects that are being modified. Then at the next minor collection, all surviving young objects will be moved out of the nursery. At this point, as we're about to continue running the major collection's marking phase, we simply add to the list of pending gray objects all the objects that we just considered --- both the objects listed as "old objects that are being modified", and the objects that we just moved out of the nursery. A fraction from the former list were black object; so this mean that they are turned back from the black to the gray color. This technique implements nicely, if indirectly, what is called a "backward write barrier" in the literature. The backwardness is about the color that needs to be changed in the opposite of the usual direction "white -> gray -> black", thus making more work for the GC. (This is as opposed to "forward write barrier", where we would also detect "black -> white" writes but turn the white object gray.)
In summary, I realize that this description is less about how we turned minimark into incminimark, and more about how we differ from the standard way of making a GC incremental. What we really had to do to make incminimark was to write logic that says "if the major collection is in the middle of the marking phase, then add this object to the list of gray objects", and put it at a few places throughout minor collection. Then we simply split a major collection into increments, doing marking or sweeping of some (relatively arbitrary) number of objects before returning. That's why, after we found that the existing write barrier would do, it was not much actual work, and could be done without major changes. For example, not a single line from the JIT needed adaptation. All in all it was relatively painless work. ;-)
Cheers,
armin and fijal
Thank you for this nice explanation.
Which mechanism do you use for not adding twice an old object in the list of modified old objects?
Very clever! But eh, your graphs show that your program is using 2-3x the memory of CPython. How much faster is your program overall in exchange for this hugely larger memory usage?
@François: a flag on the object. All old objects have this flag initially, and we use it to detect if the write barrier must trigger. We remove it when the write barrier has triggered once. We re-add it during the following minor collection.
@Anonymous: this program is slower on PyPy too. The point of the benchmark is to show that incminimark gives the same results as minimark, and to show that the JIT has bad cases. Running the same program for a much longer time (5-10x) lets PyPy slowly catch up and eventually beat CPython by a factor 2. The memory usage is evening out at around around 4 or 4.5GB (and I'd expect even larger examples to show lower consumption on PyPy, but that's mostly a guess).
Thanks for moving Python forward!
How does the incminimarc compares to Azul C4 JVM GC and Hotspots G1 GC?
In other words are there strong guarantees that for big heap sizes e.g. 12 GB the GC pauses will not exceed some value e.g. 100ms?
Sounds like great progress, but I hope you understand that even 15-30ms is way too much for games. That's 1-2 frames. It needs to be an order of magnitude less to ensure smooth FPS.
Do you have plans to give the program any say in whether the GC should strive for low latency vs. high throughput?
Great writeup, explaining that kind of concept in a clear way is not easy. And well done on the unequivocal improvements :)
@annonymous Yes you'll still miss some frames, but compared to a 1 second pause, pypy suddenly became usable for games. 55fps (over what duration did those 25K collections happen ?) is not perfect, but most users won't notice. That said, it *would* be nice to be able to tune latency vs throughput.
@Anonymous: our incminimark comes with no serious strong guarantee. I still think it's enough for most games, say, if "almost all" the pauses are around 10ms. It's also tweakable (see the PYPY_GC_* environment variables documented in rpython/memory/gc/incminimark.py, and try to call something like gc.collect(1) at the end of each frame).
Anyway, at around the same time scale is the time spent JITting, which also causes apparent pauses in the program. I think that fixing it all with really strong guarantees is a much, much harder problem. CPython doesn't gives any guarantee either, as explained at the start of the blog post.
@Armin Rigo: Yeah, it's no use pushing GC pauses much lower than other pauses, but that just means other things need improving as well. ;) If I had to draw an arbitrary line, I'd say half a frame (i.e. 8ms for 60fps, 4ms for 120fps 3D) is probably a good target for the maximum.
The thing with CPython is that you can turn the GC off and still have everything non-cyclic collected. So with enough attention to detail you can avoid GC pauses completely.
BTW, is any work ongoing with regard to fully concurrent GC?
You *cannot* avoid GC pauses in CPython: see the first paragraph of the blog post. You can only make the GC pauses deterministic, by disabling the cyclic collector. Then you can hack the program as needed to reduce GC pauses if there are some.
Thanks for this really educational and accessible explanation - it's rare to find such a concise and clear piece of writing that a non expert can understand on the subject of GC.
This needs visualization of processes to win Wikipedia article of the month.
You can also get arbitrarily long "gc" pauses in CPython by removing the last reference to some deeply nested data structure...
Wow PyPy keeps paying off!
I am so glad you guys have time (and hopefully funding) push dynamic language world forward!
If you fork() the whole process and do the marking on the frozen forked copy (on write) then you can be fully incremental without pauses, as long as you've got enough spare system memory compared to process size (as the main process keeps growing while you're marking and the pathological case of copy on write is 2x, however unlikely).
Numpy Status Update
Hi everyone
Thanks to the people who donated money to the numpy proposal, here is what I've been working on recently :
- Fixed conversion from a numpy complex number to a python complex number
- Implement the rint ufunc
- Make numpy.character usable as a dtype
- Fix ndarray(dtype=str).fill()
- Various fixes on boolean and fancy indexing
Cheers
Romain
PyCon South Africa & sprint
Hi all,
For those of you that happen to be from South Africa: don't miss PyCon ZA 2013, next October 3rd and 4th! Like last year, a few of us will be there. There will be the first talk about STM getting ready (a blog post about that should follow soon).
Moreover, general sprints will continue on the weekend (5th and 6th). Afterwards, Fijal will host a longer PyPy sprint (marathon?) with me until around the 21th. You are welcome to it as well! Write to the mailing list or to fijal directly (fijall at gmail.com), or simply in comments of this post.
--- Armin
Hey lads, any change of 64-bit arm pypy build?
now that hardware is finally generally available...
I'm sure someone at the conference has the hw, perhaps already rooted?
we don't have access to 64bit ARM. feel free to help us. also it's quite a bit of work
Hey Armin
Thanks for the awesome presentations :-)
I'm very excited to try it out soon. I was wondering, would it not be useful to try and get the "with atomic" statement at the very least working on regular CPython? (just operating on the GIL, or simulated with a lock). This could smooth over migration somewhat?
Also, thanks for your live demo of cffi, It is so much simpler than ctypes :-)
Slides of the PyPy London Demo Evening
The slides of the London demo evening are now online:
Is there a better look ink to the slides? Watching them on the blog is difficult
Clicking the full screen button makes them easy to read for me. Maybe try that?
Could there perhaps be videos from the presentation?
big up for good work!
NumPy road forward
Hello everyone.
This is the roadmap for numpy effort in PyPy as discussed on the London sprint. First, the highest on our priority list is to finish the low-level part of the numpy module. What we'll do is to finish the RPython part of numpy and provide a pip installable numpypy repository that includes the pure python part of Numpy. This would contain the original Numpy with a few minor changes.
Second, we need to work on the JIT support that will make NumPy on PyPy faster. In detail:
- reenable the lazy loop evaluation
- optimize bridges, which is depending on optimizer refactorings
- SSE support
On the compatibility front, there were some independent attempts into making the following stuff working:
- f2py
- C API (in fact, PyArray_* API is partly present in the nightly builds of PyPy)
- matplotlib (both using PyArray_* API and embedding CPython runtime in PyPy)
- scipy
In order to make all of the above happen faster, it would be helpful to raise more funds. You can donate to PyPy's NumPy project on our website. Note that PyPy is a member of SFC which is a 501(c)(3) US non-profit, so donations from US companies can be tax-deducted.
Cheers,
fijal, arigo, ronan, rguillebert, anto and others
Thanks for the update. I'm hoping the other presentations can also be summarized here for those who couldn't attend this (very interesting) mini-conference.
Thanks for the info! I can't wait to play with it.
I only have a very rudimentary understanding of numpypy and pypy, so please forgive if this is a stupid question:
Will there be a way to do additional high level optimization steps before the JIT level?
I.e. elimination of temporaries for matrices, expression optimization and so on?
Basically check if the expression should be handled by the pypy JIT, or if if should be passed on to something like numexpr
https://code.google.com/p/numexpr/
that will itself hand over the code to optimized vendor libraries?
I am a bit concerned that while the pypy JIT optimizations are without question very impressive and probably close to optimal to what can be done for generic code, the performance issues with numerical code are very different.
Any JIT will (please correct me if I am wrong, this would be a significant breakthrough) never be able to even come close to what a vendor library like the MKL can do.
The comparison will be even more to the disadvantage of the JIT if one uses a library like Theano that runs the code on the GPU.
For my work, beating c for speed is not enough anymore, the challenges are how to run the computation in parallel, how to call optimized libraries without pain and how to use a GPU without re-writing the entire program and learning about a completely new system.
Will libraries like numexpr, numba and theano be able to run under pypy, and will it eventually be possible to automatically hand over numerical expressions automatically to these libraries?
Hi Dan.
Yes, pypy will do the removal of temporary matrices, this is a very basic optimization that we had, but disabled for a while to simplify development.
I don't think numba, numexpr or theano would ever work on PyPy (I would ask their authors though), but I personally think we can match their performance or even exceed it, time will tell though.
Cheers,
fijal
Hi Maciej,
Thanks for the answer.
A pypy that matches what numba or theano can do, all without doing any extra annotation, would not only be a huge breakthrough for pypy, it will be a gigantic step forward for the entire numerics community.
Thank you and keep up the good work,
Dan
@Anonymous: I'd like to point out again that all this NumPy work would get more traction and faster development within PyPy if we could manage to interest (and get contributions from) anyone that comes from the scientific community. Ourselves, we are looking at this topic as a smallish part of the whole Python world, so we disagree (to a point) with your comment "a huge breakthrough for pypy". :-)
Preliminary London Demo Evening Agenda
We now have a preliminary agenda for the demo evening in London next week. It takes place on Tuesday, August 27 2013, 18:30-19:30 (BST) at King's College London, Strand. The preliminary agenda is as follows:
- Laurence Tratt: Welcome from the Software Development Team
- Carl Friedrich Bolz: A Short Introduction to PyPy
- Maciej Fijałkowski: Numpy on PyPy, Present State and Outlook
- Lukas Diekmann: Collection Strategies for Fast Containers in PyPy
- Armin Rigo: Software Transactional Memory for PyPy
- Edd Barrett: Unipycation: Combining Prolog and Python
All the talks are lightning talks. Afterwards there will be plenty of time for discussion.
There's still free spots, if you want to come, please register on the Eventbrite page. Hope to see you there!
The Win32 build is here, thanks Matti! https://bitbucket.org/pypy/pypy/downloads/pypy-2.2-win32.zip
Congrats! adb push pypypypy /sdcard/!
@foobie42 that's what I've done just a second ago! Gotta unpack raspbian chroot zip now...
Is speed.pypy.org still updated? The second graph on https://speed.pypy.org/ only shows 2.0 beta and trunk, and https://speed.pypy.org/comparison/ doesn't offer 2.1 or 2.2 either.
I managed to update it, check it out now
Do you have plans to support python 3.3 features?