Update on STM
Hi all,
A quick update on Software Transactional Memory. We are working on two fronts.
On the one hand, the integration of the "c4" C library with PyPy is done and works well, but is still subject to improvements. The "PyPy-STM" executable (without the JIT) seems to be stable, as far as it has been tested. It runs a simple benchmark like Richards with a 3.2x slow-down over a regular JIT-less PyPy.
The main factor of this slow-down: the numerous "barriers" in the code --- checks that are needed a bit everywhere to verify that a pointer to an object points to a recent enough version, and if not, to go to the most recent version. These barriers are inserted automatically during the translation; there is no need for us to manually put 42 million barriers in the source code of PyPy. But this automatic insertion uses a primitive algorithm right now, which usually ends up putting more barriers than the theoretical optimum. I (Armin) am trying to improve that --- and progressing: last week the slow-down was around 4.5x. This is done in the branch stmgc-static-barrier.
On the other hand, Remi is progressing on the JIT integration in the branch stmgc-c4. This has been working in simple cases since a couple of weeks by now, but the resulting "PyPy-JIT-STM" often crashes. This is because while the basics are not really hard, we keep hitting new issues that must be resolved.
The basics are that whenever the JIT is about to generate assembler corresponding to a load or a store in a GC object, it must first generate a bit of extra assembler that corresponds to the barrier that we need. This works fine by now (but could benefit from the same kind of optimizations described above, to reduce the number of barriers). The additional issues are all more subtle. I will describe the current one as an example: it is how to write constant pointers inside the assembler.
Remember that the STM library classifies objects as either "public" or "protected/private". A "protected/private" object is one which has not been seen by another thread so far. This is essential as an optimization, because we know that no other thread will access our protected or private objects in parallel, and thus we are free to modify their content in place. By contrast, public objects are frozen, and to do any change, we first need to build a different (protected) copy of the object. See this blog post for more details.
So far so good, but the JIT will sometimes (actually often) hard-code constant pointers into the assembler it produces. For example, this is the case when the Python code being JITted creates an instance of a known class; the corresponding assembler produced by the JIT will reserve the memory for the instance and then write the constant type pointer in it. This type pointer is a GC object (in the simple model, it's the Python class object; in PyPy it's actually the "map" object, which is a different story).
The problem right now is that this constant pointer may point to a protected object. This is a problem because the same piece of assembler can later be executed by a different thread. If it does, then this different thread will create instances whose type pointer is bogus: looking like a protected object, but actually protected by a different thread. Any attempt to use this type pointer to change anything on the class itself will likely crash: the threads will all think they can safely change it in-place. To fix this, we need to make sure we only write pointers to public objects in the assembler. This is a bit involved because we need to ensure that there is a public version of the object to start with.
When this is done, we will likely hit the next problem, and the next one; but at some point it should converge (hopefully!) and we'll give you our first PyPy-JIT-STM ready to try. Stay tuned :-)
A bientôt,
Armin.
NumPyPy Status Update
Hello everyone
As expected, nditer is a lot of work. I'm going to pause my work on it for now and focus on simpler and more important things, here is a list of what I implemented :
- Fixed a bug on 32 bit that made int32(123).dtype == dtype("int32") fail
- Fixed a bug on the pickling of array slices
- The external loop flag is implemented on the nditer class
- The c_index, f_index and multi_index flags are also implemented
- Add dtype("double") and dtype("str")
- C-style iteration is available for nditer
Romain Guillebert
PyPy 2.1 - Considered ARMful
We're pleased to announce PyPy 2.1, which targets version 2.7.3 of the Python
language. This is the first release with official support for ARM processors in the JIT.
This release also contains several bugfixes and performance improvements.
You can download the PyPy 2.1 release here:
https://pypy.org/download.html
We would like to thank the Raspberry Pi Foundation for supporting the work
to finish PyPy's ARM support.
The first beta of PyPy3 2.1, targeting version 3 of the Python language, was
just released, more details can be found here.
What is PyPy?
PyPy is a very compliant Python interpreter, almost a drop-in replacement for CPython 2.7. It's fast (pypy 2.1 and cpython 2.7.2 performance comparison) due to its integrated tracing JIT compiler.
This release supports x86 machines running Linux 32/64, Mac OS X 64 or Windows 32. This release also supports ARM machines running Linux 32bit - anything with ARMv6 (like the Raspberry Pi) or ARMv7 (like the Beagleboard, Chromebook, Cubieboard, etc.) that supports VFPv3 should work. Both hard-float armhf/gnueabihf and soft-float armel/gnueabi builds are provided. The armhf builds for Raspbian are created using the Raspberry Pi custom cross-compilation toolchain based on gcc-arm-linux-gnueabihf and should work on ARMv6 and ARMv7 devices running Debian or Raspbian. The armel builds are built using the gcc-arm-linux-gnuebi toolchain provided by Ubuntu and currently target ARMv7.
Windows 64 work is still stalling, we would welcome a volunteer to handle that.
Highlights
- JIT support for ARM, architecture versions 6 and 7, hard- and soft-float ABI
- Stacklet support for ARM
- Support for os.statvfs and os.fstatvfs on unix systems
- Improved logging performance
- Faster sets for objects
- Interpreter improvements
- During packaging, compile the CFFI based TK extension
- Pickling of numpy arrays and dtypes
- Subarrays for numpy
- Bugfixes to numpy
- Bugfixes to cffi and ctypes
- Bugfixes to the x86 stacklet support
- Fixed issue 1533: fix an RPython-level OverflowError for space.float_w(w_big_long_number).
- Fixed issue 1552: GreenletExit should inherit from BaseException.
- Fixed issue 1537: numpypy __array_interface__
- Fixed issue 1238: Writing to an SSL socket in PyPy sometimes failed with a "bad write retry" message.
Cheers,
David Schneider for the PyPy team.
What about gevent support in this release? i am waiting for full support to switch to pypy on production
Some issues with gevent were fixed. You need to try it out and report any remaining issues, if any.
If i read well, you did not use any ThumbEE instructions for your Arm support ? So there is room for improvement ?
Has cdecimal been backported into either version of PyPy yet? If not, any near-term plan to do so?
cdecimal is purely a speed gain. On PyPy the pure Python decimal.py is accelerated by the JIT, though it is probably possible to gain some small extra factor by rewriting it directly in RPython.
If your problem is merely that project X has listed cdecimal in its dependencies, then we could add a "cdecimal.egg-info" file that says "yup, it's installed" and be done (assuming that the API is really the same one as decimal.py).
cdecimal is actually based on a C library (libmpdec). Maybe a ffi-based binding could give interesting results.
Importing sqlite3 incurs a huge delay in the latest armhf jit nightly (15 August).
PyPy Demo Evening in London, August 27, 2013
As promised in the London sprint announcement we are organising a PyPy demo evening during the London sprint on Tuesday, August 27 2013, 18:30-19:30 (BST). The description of the event is below. If you want to come, please register on the Eventbrite page.
PyPy is a fast Python VM. Maybe you've never used PyPy and want to find out what use it might be for you? Or you and your organisation have been using it and you want to find out more about how it works under the hood? If so, this demo session is for you!
Members of the PyPy team will give a series of lightning talks on PyPy: its benefits; how it works; research currently being undertaken to make it faster; and unusual uses it can be put to. Speakers will be available afterwards for informal discussions. This is the first time an event like this has been held in the UK, and is a unique opportunity to speak to core people. Speakers confirmed thus far include: Armin Rigo, Maciej Fijałkowski, Carl Friedrich Bolz, Lukas Diekmann, Laurence Tratt, Edd Barrett.
The venue for this talk is the Software Development Team, King's College London. The main entrance is on the Strand, from where the room for the event will be clearly signposted. Travel directions can be found at https://www.kcl.ac.uk/campuslife/campuses/directions/strand.aspx
If you have any questions about the event, please contact Laurence Tratt
PyPy3 2.1 beta 1
We're pleased to announce the first beta of the upcoming 2.1 release of
PyPy3. This is the first release of PyPy which targets Python 3 (3.2.3)
compatibility.
We would like to thank all of the people who donated to the py3k proposal
for supporting the work that went into this and future releases.
You can download the PyPy3 2.1 beta 1 release here:
https://pypy.org/download.html#pypy3-2-1-beta-1
Highlights
- The first release of PyPy3: support for Python 3, targetting CPython 3.2.3!
- There are some known issues including performance regressions (issues
#1540 & #1541) slated to be resolved before the final release.
- There are some known issues including performance regressions (issues
What is PyPy?
PyPy is a very compliant Python interpreter, almost a drop-in replacement for
CPython 2.7.3 or 3.2.3. It's fast due to its integrated tracing JIT compiler.
This release supports x86 machines running Linux 32/64, Mac OS X 64 or Windows
32. Also this release supports ARM machines running Linux 32bit - anything with
ARMv6 (like the Raspberry Pi) or ARMv7 (like Beagleboard,
Chromebook, Cubieboard, etc.) that supports VFPv3 should work.
Windows 64 work is still stalling and we would welcome a volunteer to handle
that.
How to use PyPy?
We suggest using PyPy from a virtualenv. Once you have a virtualenv
installed, you can follow instructions from pypy documentation on how
to proceed. This document also covers other installation schemes.
Cheers,
the PyPy team
This is *really* cool!
Thank you for realizing pypy for python3! This should make it much easier to continue work on one of my projects (it was on hold, because pypy made it much faster, but I had to convert from python3 to python2 for running it, and that became a maintenance nightmare.
PyPy 2.1 beta 2
We're pleased to announce the second beta of the upcoming 2.1 release of PyPy.
This beta adds one new feature to the 2.1 release and contains several bugfixes listed below.
You can download the PyPy 2.1 beta 2 release here:
https://pypy.org/download.html
Highlights
- Support for os.statvfs and os.fstatvfs on unix systems.
- Fixed issue 1533: fix an RPython-level OverflowError for space.float_w(w_big_long_number).
- Fixed issue 1552: GreenletExit should inherit from BaseException.
- Fixed issue 1537: numpypy __array_interface__
- Fixed issue 1238: Writing to an SSL socket in pypy sometimes failed with a "bad write retry" message.
-
distutils: copy CPython's implementation of customize_compiler, dont call
split on environment variables, honour CFLAGS, CPPFLAGS, LDSHARED and
LDFLAGS. - During packaging, compile the CFFI tk extension.
What is PyPy?
PyPy is a very compliant Python interpreter, almost a drop-in replacement for
CPython 2.7.3. It's fast due to its integrated tracing JIT compiler.
This release supports x86 machines running Linux 32/64, Mac OS X 64 or Windows
32. Also this release supports ARM machines running Linux 32bit - anything with
ARMv6 (like the Raspberry Pi) or ARMv7 (like Beagleboard,
Chromebook, Cubieboard, etc.) that supports VFPv3 should work.
Windows 64 work is still stalling, we would welcome a volunteer
to handle that.
How to use PyPy?
We suggest using PyPy from a virtualenv. Once you have a virtualenv
installed, you can follow instructions from pypy documentation on how
to proceed. This document also covers other installation schemes.
Cheers,
The PyPy Team.
PyPy San Francisco Sprint July 27th 2013
The next PyPy sprint will be in San Francisco, California. It is a public
sprint, suitable for newcomers. It will run on Saturday July 27th.
Some possible things people will be hacking on the sprint:
- running your software on PyPy
- making your software fast on PyPy
- improving PyPy's JIT
- improving Twisted on PyPy
- any exciting stuff you can think of
If there are newcomers, we'll run an introduction to hacking on PyPy.
Location
The sprint will be held at the Rackspace Office:
620 Folsom St, Ste 100
The doors will open at 10AM and run until 6PM.
PyPy London Sprint (August 26 - September 1 2013)
The next PyPy sprint will be in London, United Kingdom for the first time. This is a fully public sprint. PyPy sprints are a very good way to get into PyPy development and no prior PyPy knowledge is necessary.
Goals and topics of the sprint
For newcomers:
- bring your application/library and we'll help you port it to PyPy, benchmark and profile
- come and write your favorite missing numpy function
- help us work on developer tools like jitviewer
We'll also work on:
- refactoring the JIT optimizations
- STM and STM-related topics
- anything else attendees are interested in
Exact times
The work days should be August 26 - September 1 2013 (Monday-Sunday). The official plans are for people to arrive on the 26th, and to leave on the 2nd. There will be a break day in the middle. We'll typically start at 10:00 in the morning.
Location
The sprint will happen within a room of King's College's Strand Campus in Central London, UK. There are some travel instructions how to get there. We are being hosted by Laurence Tratt and the Software Development Team.
Demo Session
If you don't want to come to the full sprint, but still want to chat a bit, we are planning to have a demo session on Tuesday August 27. We will announce this separately on the blog. If you are interested, please leave a comment.
Registration
If you want to attend, please register by adding yourself to the "people.txt" file in Mercurial:
https://bitbucket.org/pypy/extradoc/ https://foss.heptapod.net/pypy/extradoc/-/blob/branch/default/extradoc/sprintinfo/london-2013
or on the pypy-dev mailing list if you do not yet have check-in rights:
https://mail.python.org/mailman/listinfo/pypy-dev
Remember that you may need a (insert country here)-to-UK power adapter. Please note that UK is not within the Schengen zone, so non-EU and non-Switzerland citizens may require specific visa. Please check travel regulations. Also, the UK uses pound sterling (GBP).
Cannot quite get a week off for this, but would be very interested in the demo session on the Tuesday.
Software Transactional Memory lisp experiments
As covered in the previous blog post, the STM subproject of PyPy has been back on the drawing board. The result of this experiment is an STM-aware garbage collector written in C. This is finished by now, thanks to Armin's and Remi's work, we have a fully functional garbage collector and a STM system that can be used from any C program with enough effort. Using it is more than a little mundane, since you have to inserts write and read barriers by hand everywhere in your code that reads or writes to garbage collector controlled memory. In the PyPy integration, this manual work is done automatically by the STM transformation in the interpreter.
However, to experiment some more, we created a minimal lisp-like/scheme-like interpreter (called Duhton), that follows closely CPython's implementation strategy. For anyone familiar with CPython's source code, it should be pretty readable. This interpreter works like a normal and very basic lisp variant, however it comes with a transaction builtin, that lets you spawn transactions using the STM system. We implemented a few demos that let you play with the transaction system. All the demos are running without conflicts, which means there are no conflicting writes to global memory and hence the demos are very amenable to parallelization. They exercise:
- arithmetics - demo/many_sqare_roots.duh
- read-only access to globals - demo/trees.duh
- read-write access to local objects - demo/trees2.duh
With the latter ones being very similar to the classic gcbench. STM-aware Duhton can be found in the stmgc repo, while the STM-less Duhton, that uses refcounting, can be found in the duhton repo under the base branch.
Below are some benchmarks. Note that this is a little comparing apples to oranges since the single-threaded duhton uses refcounting GC vs generational GC for STM version. Future pypy benchmarks will compare more apples to apples. Moreover none of the benchmarks has any conflicts. Time is the total time that the benchmark took (not the CPU time) and there was very little variation in the consecutive runs (definitely below 5%).
benchmark | 1 thread (refcount) | 1 thread (stm) | 2 threads | 4 threads |
square | 1.9s | 3.5s | 1.8s | 0.9s |
trees | 0.6s | 1.0s | 0.54s | 0.28s |
trees2 | 1.4s | 2.2s | 1.1s | 0.57s |
As you can see, the slowdown for STM vs single thread is significant (1.8x, 1.7x, 1.6x respectively), but still lower than 2x. However the speedup from running on multiple threads parallelizes the problem almost perfectly.
While a significant milestone, we hope the next blog post will cover STM-enabled pypy that's fully working with JIT work ongoing.
Cheers,
fijal on behalf of Remi Meier and Armin Rigo
I hacked a bit; inserted likely hint on early exit on spinlock acquisition, Haswell xacquire/xrelease hints on spinlock acquisition and release, and compiled with Haswell optimized flags.
Resulting scaling from 1 to 4 threads for tests were 1.92, 1.87 and 1.88. I think that's already quite close to 2.
I think this is OK, but not extraordinary.
Just to clarify my above comment: those were average factors of scaling per doubling of threads. So, 4-thread version ran actually 3.67, 3.50 and 3.54 times faster than single-threaded version.
Cool that you hacked on it! Note however that spinlock acquisition is not a blocker in these examples --- we implement STM mostly without locks, and locks are acquired rarely. Running independent code without getting STM conflicts means that each thread will in practice only acquire its own lock. And a single global lock is used for major GC --- but there, the large amount of work done means that using the Haswell xacquire/xrelease hints is just counterproductive.
"Resulting scaling from 1 to 4 threads" doesn't mean anything, as in some examples it scales perfectly, and in other examples it doesn't scale at all (as expected).
All your arguments are valid, and I didn't really expect much from hinting, just decided to try. It would seem that Haswell is still inching towards higher multicore scalability - probably thanks to improved atomic and fence ops in general. It's a benefit for those workloads that should conceptually scale well...
You really need to go above 4 threads: 8,16,32, and 64 at least. Then plot out the overhead of the STM related to this level of threading. If your benchmark is too small, alter it so that it makes sense to try and solve it with 64 threads.
@glen: we're focusing right now on the machines we have, which are standard Intels with 4, 8, or at most 12 cores. I believe it is interesting too, and it's what people have right now in their own desktop or laptop computers. Obviously the scalability to larger numbers of cores is important as well, but we can't simply disregard any result involving less than 64 cores.
PyPy 2.1 beta
We're pleased to announce the first beta of the upcoming 2.1 release of PyPy. This beta contains many bugfixes and improvements, numerous improvements to the numpy in pypy effort. The main feature being that the ARM processor support is not longer considered alpha level.
We would like to thank the Raspberry Pi Foundation for supporting the work to finish PyPy's ARM support.
You can download the PyPy 2.1 beta release here:
https://pypy.org/download.html
Highlights
- Bugfixes to the ARM JIT backend, so that ARM is now an officially
supported processor architecture - Stacklet support on ARM
- Interpreter improvements
- Various numpy improvements
- Bugfixes to cffi and ctypes
- Bugfixes to the stacklet support
- Improved logging performance
- Faster sets for objects
What is PyPy?
PyPy is a very compliant Python interpreter, almost a drop-in replacement for CPython 2.7.3. It's fast due to its integrated tracing JIT compiler. This release supports x86 machines running Linux 32/64, Mac OS X 64 or Windows 32. Also this release supports ARM machines running Linux 32bit - anything with ARMv6 (like the Raspberry Pi) or ARMv7 (like Beagleboard, Chromebook, Cubieboard, etc.) that supports VFPv3 should work. Both hard-float armhf/gnueabihf and soft-float armel/gnueabi builds are provided. armhf builds for Raspbian are created using the Raspberry Picustom cross-compilation toolchain based on gcc-arm-linux-gnueabihf and should work on ARMv6 and ARMv7 devices running Debian or Raspbian. armel builds are built using the gcc-arm-linux-gnuebi toolchain provided by Ubuntu and currently target ARMv7.
Windows 64 work is still stalling, we would welcome a volunteer to handle that.
How to use PyPy?
We suggest using PyPy from a virtualenv. Once you have a virtualenv installed, you can follow instructions from pypy documentation on how to proceed. This document also covers other installation schemes.Cheers,
the PyPy team.
*assembly
Thanks for the update; glad it's coming together! I'm really looking forward to seeing how it stacks up once the JIT work is complete.
Do you think that it'll be possible to ever get better than a 2x slowdown for serial operations? Or is that the minimal possible? Naively, it makes sense that it'll never be as fast, but if 1.5x or lower were possible, that would be very exciting.
Also, is the end goal that you would have a module you import to "turn on" STM? Or would it always be a separate build of pypy, just like JIT/JIT-less?
@Christopher: the slow-down we'll get is still unknown, but I fear it won't really go well under 2x.
I see it mainly as a separate build: either you want to run all these barrier instructions everywhere (which gives the slow-down) or not. It could be possible in theory to have a version that has the barriers everywhere, but creates JIT-generated assembler that doesn't, and thus runs almost as fast as a regular PyPy as long as you don't "turn on" STM. We will see if that makes sense.
@Anonymous: ah, thanks :-) I think I now learned the difference between "assembler" and "assembly" in English, which was never quite clear to me. Note that in french the same word ("assembleur") is used to mean both terms.
@Armin: Ah, I see. Well, from a user's perspective, what I most write in python these days is either GUI applications (for which I've never been able to use pypy due to lack of bindings, but that's another issue entirely), or for small services, for which pypy has provided a rather nice speed improvement.
In a perfect world, I'd be able to use pypy for both of these tasks, not using STM for my GUI applications, but turning it on for the services I write (well, once they reach a certain point where I'd gain something from concurrency).
I suspect having a separate build would make such a use-case awkward.
Also, my interest is a bit self-motivated; at work we current use node.js for a lot of our services. Pypy compares decently for a lot of our tasks, but it not 'clearly better'. Once STM is stable, however, several of our services that we've struggled scaling to multiple cores on node.js could be rewritten in pypy STM, and should scale much easier. (Manual process management is painful!)
Again, if pypy STM were a seperate build, we'd have to manage having both installed in the case where we have servers running services that need concurrency, or ones that work well enough with a very fast async implementation. Not impossible, just a bit awkward. :)
Either way, I'm pretty excited!
Are there any plans or experiments going on related to Hardware Transactional Memory?
@Ignacio Hernandez: for HTM, our position is still as described last year in: https://morepypy.blogspot.com/2012/08/multicore-programming-in-pypy-and.html