Faster, more memory efficient and more ordered dictionaries on PyPy
Hello everyone!
As of today, we merged the latest branch that brings better dictionaries to PyPy by default. The work is based on an idea by Raymond Hettinger on python-dev, with prior work done notably in Java. It was done by Maciej Fijałkowski and Armin Rigo, with Laurence Tratt recently prodding us to finish it. (Earlier work going in a similar direction include Alex Gaynor's work on ordered dicts in Topaz, which was also used in the Hippy VM. Each of these pieces of work is itself based on the original dict implementation in RPython, whose origins fade in the Subversion prehistory of PyPy.) Coincidentally, a very similar idea has been implemented in Zend PHP very recently. Zend implementation description.
This post covers the basics of design and implementation as well as some basic benchmarks.
Dictionaries are now ordered!
One surprising part is that the new design, besides being more memory efficient, is ordered by design: it preserves the insertion order. This is not forbidden by the Python language, which allows any order. It makes the collections.OrderedDict subclass much faster than before: it is now a thin subclass of dict. Obviously, we recommend that any portable Python program continues to use OrderedDict when ordering is important. Note that a non-portable program might rely on more: for example, a **keywords argument now receives the keywords in the same order as the one in which they were given in the call. (Whether such a thing might be called a language design change or not is a bit borderline.) The point is that Python programs that work on CPython or previous versions of PyPy should continue to work on PyPy.
There is one exception, though. The iterators of the OrderedDict subclass are now working just like the ones of the dict builtin: they will raise RuntimeError when iterating if the dictionary was modified. In the CPython design, the class OrderedDict explicitly doesn't worry about that, and instead you get some result that might range from correct to incorrect to crashes (i.e. random Python exceptions).
Original PyPy dictionary design
Originally, PyPy dictionaries, as well as CPython dictionaries are implemented as follows (simplified view):
struct dict { long num_items; dict_entry* items; /* pointer to array */ } struct dict_entry { long hash; PyObject* key; PyObject* value; }
Where items is a sparse array, with 1/3 to 1/2 of the items being NULL. The average space occupied by a dictionary is 3 * WORD * 12/7 plus some small constant (the smallest dict has 8 entries, which is 8 * 3 * WORD + 2 * WORD = 26 WORDs).
New PyPy dictionary design
The new PyPy dictionary is split in two arrays:
struct dict { long num_items; variable_int *sparse_array; dict_entry* compact_array; } struct dict_entry { long hash; PyObject *key; PyObject *value; }
Here, compact_array stores all the items in order of insertion, while sparse_array is a 1/2 to 2/3 full array of integers. The integers themselves are of the smallest size necessary for indexing the compact_array. So if compact_array has less than 256 items, then sparse_array will be made of bytes; if less than 2^16, it'll be two-byte integers; and so on.
This design saves quite a bit of memory. For example, on 64bit systems we can, but almost never, use indexing of more than 4 billion elements; and for small dicts, the extra sparse_array takes very little space. For example a 100 element dict, would be on average for the original design on 64bit: 100 * 12/7 * WORD * 3 =~ 4100 bytes, while on new design it's 100 * 12/7 + 3 * WORD * 100 =~ 2600 bytes, quite a significant saving.
GC friendliness
The obvious benefit of having more compact dictionaries is an increased cache friendliness. In modern CPUs cache misses are much more costly than doing additional simple work, like having an additional level of (in-cache) indirection. Additionally, there is a GC benefit coming from it. When doing a minor collection, the GC has to visit all the GC fields in old objects that can point to young objects. In the case of large arrays, this can prove problematic since the array grows and with each minor collection we need to visit more and more GC pointers. In order to avoid it, large arrays in PyPy employ a technique called "card marking" where the GC only visits "cards" or subsets of arrays that were modified between collections. The problem with dictionaries was that by design modifications in a dictionary occur randomly, hence a lot of cards used to get invalidated. In the new design, however, new items are typically appended to the compact_array, hence invalidate much fewer cards --- which improves GC performance. (The new sparse_array is an array of integers, so it does not suffer from the same problems.)
Deletion
Deleting entries from dictionaries is not very common, but important in a few use cases. To preserve order, when we delete an entry, we mark the entry as removed but don't otherwise shuffle the remaining entries. If we repeat this operation often enough, there will be a lot of removed entries in the (originally compact) array. At this point, we need to do a "packing" operation, which moves all live entries to the start of the array (and then reindexes the sparse array, as the positions changed). This works well, but there are use cases where previously no reindexing was ever needed, so it makes these cases a bit slower (for example when repeatedly adding and removing keys in equal number).
Benchmarks
The PyPy speed benchmarks show mostly small effect, see changes. The microbenchmarks that we did show large improvements on large and very large dictionaries (particularly, building dictionaries of at least a couple 100s of items is now twice faster) and break-even on small ones (between 20% slower and 20% faster depending very much on the usage patterns and sizes of dictionaries). The new dictionaries enable various optimization possibilities which we're going to explore in the near future.
Cheers,
fijal, arigo and the PyPy team
Leysin Winter Sprint (20-28th February 2015)
The next PyPy sprint will be in Leysin, Switzerland, for the tenth time. This is a fully public sprint: newcomers and topics other than those proposed below are welcome.
Goals and topics of the sprint
The details depend on who is here and ready to work. We might touch topics such as:
- cleaning up the optimization step in the JIT, change the register allocation done by the JIT's backend, or improvements to the warm-up time
- STM (Software Transaction Memory), notably: try to come up with benchmarks, and measure them carefully in order to test and improve the conflict reporting tools, and more generally to figure out how practical it is in large projects to avoid conflicts
- vmprof - a statistical profiler for CPython and PyPy work, including making it more user friendly.
- Py3k (Python 3.x support), NumPyPy (the numpy module)
- added: cffi 1.0, trying out pygame+cffi on Raspberry Pi devices
- And as usual, the main side goal is to have fun in winter sports :-) We can take a day off for ski.
Exact times
For a change, and as an attempt to simplify things, I specified the dates as 20-28 Februrary 2015, where 20 and 28 are travel days. We will work full days between the 21 and the 27. You are of course allowed to show up for a part of that time only, too.
Location and Accomodation
Leysin, Switzerland, "same place as before". Let me refresh your memory: both the sprint venue and the lodging will be in a very spacious pair of chalets built specifically for bed & breakfast: Ermina. The place has a good ADSL Internet connection with wireless installed. You can of course arrange your own lodging anywhere (as long as you are in Leysin, you cannot be more than a 15 minutes walk away from the sprint venue), but I definitely recommend lodging there too -- you won't find a better view anywhere else (though you probably won't get much worse ones easily, either :-)
Please confirm that you are coming so that we can adjust the reservations as appropriate. In the past, the rates were around 60 CHF a night all included in 2-person rooms, with breakfast. Now, the rooms available are either single-person (or couple), or rooms for 3 persons. The latter choice is recommended and should be under 60 CHF per person.
Please register by Mercurial, or on the pypy-dev mailing list if you do not yet have check-in rights.
You need a Swiss-to-(insert country here) power adapter. There will be some Swiss-to-EU adapters around, and at least one EU-format power strip.
Hi,
During this sprint, ss it plan to work on yield form syntax, or more generally, Python 3.3 support ?
I'm very interested to test PyPy with AsyncIO.
Regards
@Ludovic, we don't have precise plans. If there is someone also interested in Python 3, then yes, this kind of work would be nice. (Note that I see some tests about "yield from" in the py3.3 branch, which may mean that it was implemented already.)
September donations and thank you to the Python Software Foundation!
Hello everyone!
We would like to show you a short update on the PyPy funding. We gathered a total of $15,986 in the month of September and as per earlier agreement, the Python Software Foundation donated $10,000 to PyPy. We would like to thank everyone participating and the PSF in particular for supporting the PyPy project and making our work possible!
We've been working hard on the goals outlined in the funding proposals.
- PyPy Python 3 support has been in beta for a while and it's already being used by many people, as seen per the number of reported bugs. We're currently supporting 3.2, planning on moving towards 3.4 in the future.
- Software Transactional Memory has been a successful research project, with first real world results shown during the Warsaw sprint.
- More detailed update on numpy will be published soon. A little spoiler is that we're planning on addressing matplotlib, scipy and the larger ecosystem to some extent. Stay tuned!
Again, thanks to everyone who donated and happy Thanksgiving to everyone on that side of the world!
Cheers,
fijal and the entire PyPy team
Fantastic work!
I'm a Python 3 user, as such the PyPy3 was great for me!
And good news for Numpypy, it would indeed be awesome for supporting the numeric ecosystem.
This is great news! I've been waiting for scipy and matplotlib for a while, now it's finally on the roadmap.
Tornado without a GIL on PyPy STM
This post is by Konstantin Lopuhin, who tried PyPy STM during the Warsaw sprint.
Python has a GIL, right? Not quite - PyPy STM is a python implementation without a GIL, so it can scale CPU-bound work to several cores. PyPy STM is developed by Armin Rigo and Remi Meier, and supported by community donations. You can read more about it in the docs.
Although PyPy STM is still a work in progress, in many cases it can already run CPU-bound code faster than regular PyPy, when using multiple cores. Here we will see how to slightly modify Tornado IO loop to use transaction module. This module is described in the docs and is really simple to use - please see an example there. An event loop of Tornado, or any other asynchronous web server, looks like this (with some simplifications):
while True: for callback in list(self._callbacks): self._run_callback(callback) event_pairs = self._impl.poll() self._events.update(event_pairs) while self._events: fd, events = self._events.popitem() handler = self._handlers[fd] self._handle_event(fd, handler, events)
We get IO events, and run handlers for all of them, these handlers can also register new callbacks, which we run too. When using such a framework, it is very nice to have a guaranty that all handlers are run serially, so you do not have to put any locks. This is an ideal case for the transaction module - it gives us guaranties that things appear to be run serially, so in user code we do not need any locks. We just need to change the code above to something like:
while True: for callback in list(self._callbacks): transaction.add( # added self._run_callback, callback) transaction.run() # added event_pairs = self._impl.poll() self._events.update(event_pairs) while self._events: fd, events = self._events.popitem() handler = self._handlers[fd] transaction.add( # added self._handle_event, fd, handler, events) transaction.run() # added
The actual commit is here, - we had to extract a little function to run the callback.
Part 1: a simple benchmark: primes
Now we need a simple benchmark, lets start with this - just calculate a list of primes up to the given number, and return it as JSON:
def is_prime(n): for i in xrange(2, n): if n % i == 0: return False return True class MainHandler(tornado.web.RequestHandler): def get(self, num): num = int(num) primes = [n for n in xrange(2, num + 1) if is_prime(n)] self.write({'primes': primes})
We can benchmark it with siege:
siege -c 50 -t 20s https://localhost:8888/10000
But this does not scale. The CPU load is at 101-104 %, and we handle 30 % less request per second. The reason for the slowdown is STM overhead, which needs to keep track of all writes and reads in order to detect conflicts. And the reason for using only one core is, obviously, conflicts! Fortunately, we can see what this conflicts are, if we run code like this (here 4 is the number of cores to use):
PYPYSTM=stm.log ./primes.py 4
Then we can use print_stm_log.py to analyse this log. It lists the most expensive conflicts:
14.793s lost in aborts, 0.000s paused (1258x STM_CONTENTION_INEVITABLE) File "/home/ubuntu/tornado-stm/tornado/tornado/httpserver.py", line 455, in __init__ self._start_time = time.time() File "/home/ubuntu/tornado-stm/tornado/tornado/httpserver.py", line 455, in __init__ self._start_time = time.time() ...
There are only three kinds of conflicts, they are described in stm source, Here we see that two threads call into external function to get current time, and we can not rollback any of them, so one of them must wait till the other transaction finishes. For now we can hack around this by disabling this timing - this is only needed for internal profiling in tornado.
If we do it, we get the following results (but see caveats below):
|
As we can see, in this benchmark PyPy STM using just two cores can beat regular PyPy! This is not linear scaling, there are still conflicts left, and this is a very simple example but still, it works!
But its not that simple yet :)
First, these are best-case numbers after long (much longer than for regular PyPy) warmup. Second, it can sometimes crash (although removing old pyc files fixes it). Third, benchmark meta-parameters are also tuned.
Here we get relatively good results only when there are a lot of concurrent clients - as a results, a lot of requests pile up, the server is not keeping with the load, and transaction module is busy with work running this piled up requests. If we decrease the number of concurrent clients, results get slightly worse. Another thing we can tune is how heavy is each request - again, if we ask primes up to a lower number, then less time is spent doing calculations, more time is spent in tornado, and results get much worse.
Besides the time.time() conflict described above, there are a lot of others. The bulk of time is lost in these two conflicts:
14.153s lost in aborts, 0.000s paused (270x STM_CONTENTION_INEVITABLE) File "/home/ubuntu/tornado-stm/tornado/tornado/web.py", line 1082, in compute_etag hasher = hashlib.sha1() File "/home/ubuntu/tornado-stm/tornado/tornado/web.py", line 1082, in compute_etag hasher = hashlib.sha1() 13.484s lost in aborts, 0.000s paused (130x STM_CONTENTION_WRITE_READ) File "/home/ubuntu/pypy/lib_pypy/transaction.py", line 164, in _run_thread got_exception)
The first one is presumably calling into some C function from stdlib, and we get the same conflict as for time.time() above, but is can be fixed on PyPy side, as we can be sure that computing sha1 is pure.
It is easy to hack around this one too, just removing etag support, but if we do it, performance is much worse, only slightly faster than regular PyPy, with the top conflict being:
83.066s lost in aborts, 0.000s paused (459x STM_CONTENTION_WRITE_WRITE) File "/home/arigo/hg/pypy/stmgc-c7/lib-python/2.7/_weakrefset.py", line 70, in __contains__ File "/home/arigo/hg/pypy/stmgc-c7/lib-python/2.7/_weakrefset.py", line 70, in __contains__
Comment by Armin: It is unclear why this happens so far. We'll investigate...
The second conflict (without etag tweaks) originates in the transaction module, from this piece of code:
while True: self._do_it(self._grab_next_thing_to_do(tloc_pending), got_exception) counter[0] += 1
Comment by Armin: This is a conflict in the transaction module itself; ideally, it shouldn't have any, but in order to do that we might need a little bit of support from RPython or C code. So this is pending improvement.
Tornado modification used in this blog post is based on 3.2.dev2. As of now, the latest version is 4.0.2, and if we apply the same changes to this version, then we no longer get any scaling on this benchmark, and there are no conflicts that take any substantial time.
Comment by Armin: There are two possible reactions to a conflict. We can either abort one of the two threads, or (depending on the circumstances) just pause the current thread until the other one commits, after which the thread will likely be able to continue. The tool ``print_stm_log.py`` did not report conflicts that cause pauses. It has been fixed very recently. Chances are that on this test it would report long pauses and point to locations that cause them.
Part 2: a more interesting benchmark: A-star
Although we have seen that PyPy STM is not all moonlight and roses, it is interesting to see how it works on a more realistic application.
astar.py is a simple game where several players move on a map (represented as a list of lists of integers), build and destroy walls, and ask server to give them shortest paths between two points using A-star search, adopted from ActiveState recipie.
The benchmark bench_astar.py is simulating players, and tries to put the main load on A-star search, but also does some wall building and destruction. There are no locks around map modifications, as normal tornado is executing all callbacks serially, and we can keep this guaranty with atomic blocks of PyPy STM. This is also an example of a program that is not trivial to scale to multiple cores with separate processes (assuming more interesting shared state and logic).
This benchmark is very noisy due to randomness of client interactions (also it could be not linear), so just lower and upper bounds for number of requests are reported
Impl. | req/s |
---|---|
PyPy 2.4 | 5 .. 7 |
CPython 2.7 | 0.5 .. 0.9 |
PyPy-STM 1 | 2 .. 4 |
PyPy STM 4 | 2 .. 6 |
Clearly this is a very bad benchmark, but still we can see that scaling is worse and STM overhead is sometimes higher. The bulk of conflicts come from the transaction module (we have seen it above):
91.655s lost in aborts, 0.000s paused (249x STM_CONTENTION_WRITE_READ) File "/home/ubuntu/pypy/lib_pypy/transaction.py", line 164, in _run_thread got_exception)
Although it is definitely not ready for production use, you can already try to run things, report bugs, and see what is missing in user-facing tools and libraries.
Benchmarks setup:
- Amazon c3.xlarge (4 cores) running Ubuntu 14.04
- pypy-c-r74011-stm-jit for the primes benchmark (but it has more bugs than more recent versions), and pypy-c-r74378-74379-stm-jit for astar benchmark (put it inside pypy source checkout at 38c9afbd253c)
- https://bitbucket.org/kostialopuhin/tornado-stm-bench at 65144cda7a1f
"Clearly this is a very benchmark" - looks like you've missed a word here ;)
in bench_astar.py, you are doing the following queries:
- try to move: 85%
- build a wall: 10.5% [(1-.85)*.7]
- erase something: 0.45% [(1-.85)*(1-.7)*.1]
- show map: 4.05% [(1-.85)*(1-.7)*(1-.1)]
I doubt that's intentional.... :P
Correct me if I misunderstood the theory of PyPy-STM, but in the A* test there's nothing that prevents a get() to read the game map while MapChangeHandler.put() is running (that is, while the system is in an incoherent status)?
Shouldn't MapChangeHandler.put() be wrapped in a exclusive write lock, and all the get() handlers be wrapped with a shared read lock?
> Clearly this is a very benchmark" - looks like you've missed a word here ;)
Oh, yes, that word is "bad" :)
> Shouldn't MapChangeHandler.put() be wrapped in a exclusive write lock, and all the get() handlers be wrapped with a shared read lock?
Here all request handlers are already wrapped inside atomic blocks, but this is hidden from us in (modified) tornado. So we do not need any locks (as in normal tornado too, because normal tornado is single threaded). If request handlers conflict, then we just loose performance, not correctness. This is one of the main points of PyPy STM: it can support multithreaded code without needing to use locks.
Regarding the probabilities: yes, that's not quite intentional)
PyPy IO improvements
Hello everyone!
We've wrapped up the Warsaw sprint, so I would like to describe some branches which have been recently merged and which improved the I/O and the GC: gc_no_cleanup_nursery and gc-incminimark-pinning.
The first branch was started by Wenzhu Man for her Google Summer of Code and finished by Maciej Fijałkowski and Armin Rigo. The PyPy GC works by allocating new objects in the young object area (the nursery), simply by incrementing a pointer. After each minor collection, the nursery has to be cleaned up. For simplicity, the GC used to do it by zeroing the whole nursery.
This approach has bad effects on the cache, since you zero a large piece of memory at once and do unnecessary work for things that don't require zeroing like large strings. We mitigated the first problem somewhat with incremental nursery zeroing, but this branch removes the zeroing completely, thus improving the string handling and recursive code (since jitframes don't requires zeroed memory either). I measured the effect on two examples: a recursive implementation of fibonacci and gcbench, to measure GC performance.
The results for fibonacci and gcbench are below (normalized to cpython 2.7). Benchmarks were run 50 times each (note that the big standard deviation comes mostly from the warmup at the beginning, true figures are smaller):
benchmark | CPython | PyPy 2.4 | PyPy non-zero |
fibonacci | 4.8+-0.15 (1.0x) | 0.59+-0.07 (8.1x) | 0.45+-0.07 (10.6x) |
gcbench | 22+-0.36 (1.0x) | 1.34+-0.28 (16.4x) | 1.02+-0.15 (21.6x) |
The second branch was done by Gregor Wegberg for his master thesis and finished by Maciej Fijałkowski and Armin Rigo. Because of the way it works, the PyPy GC from time to time moves the objects in memory, meaning that their address can change. Therefore, if you want to pass pointers to some external C function (for example, write(2) or read(2)), you need to ensure that the objects they are pointing to will not be moved by the GC (e.g. when running a different thread). PyPy up to 2.4 solves the problem by copying the data into or from a non-movable buffer, which is obviously inefficient. The branch introduce the concept of "pinning", which allows us to inform the GC that it is not allowed to move a certain object for a short period of time. This introduces a bit of extra complexity in the garbage collector, but improves the I/O performance quite drastically, because we no longer need the extra copy to and from the non-movable buffers.
In this benchmark, which does I/O in a loop, we either write a number of bytes from a freshly allocated string into /dev/null or read a number of bytes from /dev/full. I'm showing the results for PyPy 2.4, PyPy with non-zero-nursery and PyPy with non-zero-nursery and object pinning. Those are wall times for cases using os.read/os.write and file.read/file.write, normalized against CPython 2.7.
Benchmarks were done using PyPy 2.4 and revisions 85646d1d07fb for non-zero-nursery and 3d8fe96dc4d9 for non-zero-nursery and pinning. The benchmarks were run once, since the standard deviation was small.
The Y axis is speed, normalized to CPython, the more the better
What we can see is that os.read and os.write both improved greatly and outperforms CPython now for each combination. file operations are a little more tricky, and while those branches improved the situation a bit, the improvement is not as drastic as in os versions. It really should not be the case and it showcases how our file buffering is inferior to CPython. We plan on removing our own buffering and using FILE* in C in the near future, so we should outperform CPython on those too (since our allocations are cheaper). If you look carefully in the benchmark, the write function is copied three times. This hack is intended to avoid JIT overspecializing the assembler code, which happens because the buffering code was written way before the JIT was done. In fact, our buffering is hilariously bad, but if stars align correctly it can be JIT-compiled to something that's not half bad. Try removing the hack and seeing how the performance of the last benchmark drops :-) Again, this hack should be absolutely unnecessary once we remove our own buffering, stay tuned for more.
Cheers,
fijal
Sounds great!!!
Just wondering, will the pin-memory also improves the situation when passing strings/other buffers to c functions (e.g. via cffi)?
Hey,
In your benchmark, the following loop:
for i in range(num):
os.write(fd, " " * num2)
Is not hoisted out by CPython (whereas I guess PyPy does hoist it).
Which means that the buffer written is basically allocated/freed upon each loop.
If you want to measure pure I/O performance (so let's say a zero-copy setting), it should be hoisted manually out of the loop for CPython, like this:
payload = b" " * num2
for i in range(num):
os.write(fd, payload)
Then, the results go from:
fwrite 100 bytes, 1.93us per write
fwrite 1000 bytes, 2.57us per write
fwrite 10000 bytes, 6.73us per write
file_write 100 bytes, 0.99us per write
file_write 1000 bytes, 1.68us per write
file_write 10000 bytes, 4.71us per write
to
fwrite 100 bytes, 1.38us per write
fwrite 1000 bytes, 1.48us per write
fwrite 10000 bytes, 1.38us per write
file_write 100 bytes, 0.65us per write
file_write 1000 bytes, 0.96us per write
file_write 10000 bytes, 2.32us per write
Also, might be worth trying wth binary mode.
Anyway, keep up the great work!
PyPy does not hoist the buffer allocation here. The benchmark specifically allocated/frees the buffer every loop, since we want the object written fresh (otherwise pinning is not needed), but also we think that writing a new object (as opposed to the constant buffer) is really more of a common case. Yes, you get an overhead of allocation measured too, but the case here is that we wanted to measure the IO of fresh objects, not old ones
PyPy3 2.4.0 released
This release contains several bugfixes and enhancements. Among the user-facing improvements specific to PyPy3:
- Better Windows compatibility, e.g. the nt module functions _getfinalpathname & _getfileinformation are now supported (the former is required for the popular pathlib library for example)
- Various fsencode PEP 383 related fixes to the posix module (readlink, uname, ttyname and ctermid) and improved locale handling
- Switched the default binary name on POSIX distributions from 'pypy' to 'pypy3' (which symlinks to to 'pypy3.2')
- Fixed a couple different crashes related to parsing Python 3 source code
And improvements shared with the recent PyPy 2.4.0 release:
- internal refactoring in string and GIL handling which led to significant speedups
- improved handling of multiple objects (like sockets) in long-running programs. They are collected and released more efficiently, reducing memory use. In simpler terms - we closed what looked like a memory leak
- Windows builds now link statically to zlib, expat, bzip, and openssl-1.0.1i
- Many issues were resolved since the 2.3.1 release in June
You can download PyPy3 2.4.0 here https://pypy.org/download.html.
PyPy is a very compliant Python interpreter, almost a drop-in replacement for CPython 2.7 and 3.2.5. It's fast (pypy 2.4 and cpython 2.7.x performance comparison) due to its integrated tracing JIT compiler.
This release supports x86 machines running Linux 32/64, Mac OS X 64, Windows, and OpenBSD, as well as newer ARM hardware (ARMv6 or ARMv7, with VFPv3) running Linux.
We would like to thank our donors for the continued support of the PyPy project.
The complete release notice is here.
Please try it out and let us know what you think. We especially welcome success stories, please tell us about how it has helped you!
Cheers, The PyPy Team
That’s great - thanks!
And the portable release directly works for my keyboard evolution! (it’s roughly 2.5x faster than cPython).
Couchbase contribution to PyPy
Hello everyone!
We always offer to put on the blog info about our sponsors who donate substantial amounts of money. So far most people decided to stay anonymous, so this is the first blog post describing our sponsor and his relationship to PyPy, hopefully not the last. We'll also publish a full blog post about the PSF-matched fundraiser soon. This is a guest post by Brent Woodruff from Couchbase.
Couchbase is a leading NoSQL document database that provides a flexible data model, high performance, scalability, and high availability. Couchbase is a commercially supported open source project. Visit us at https://www.couchbase.com and https://github.com/couchbase.
Couchbase Inc. donated $2000.00, and employees of Couchbase personally contributed a disclosed additional $230.00, towards Pypy progress during the September funding drive. These funds will see a match from the Python Software Foundation.
Pypy is primarily used by Couchbase employees to perform product analysis and troubleshooting using internally developed tools. Every customer of Couchbase benefits from the use of Pypy; both due to the rapid development provided by Python, and the speed of the resulting tools provided by the Pypy JIT interpreter.
“PyPy is great - it gave us a 4x speedup in our CPU-intensive internal application over CPython” -Dave Rigby and Daniel Owen, Couchbase Engineers
Additionally, Couchbase has a preliminary CFFI based Couchbase client available for Pypy users.
Definitely wouldn't have thought to put PyPy and Couchbase in the same sentence, but this is very good of them! Glad to see the support.
Thanks for the donation. Could you give a bit more detail of how hard it was to make your code compatible with PyPy?
Hello from Couchbase. With regards to making our code compatible with PyPy, I can only comment on our internal tooling. Those are currently all pure Python, so it was trivial. We used modules that work with PyPy already: namely pyparsing, LEPL, and tornado. The tools all run under both CPython and PyPy unmodified.
PyPy 2.4.0 released, 9 days left in funding drive
This release contains several bugfixes and enhancements. Among the user-facing improvements:
- internal refactoring in string and GIL handling which led to significant speedups
- improved handling of multiple objects (like sockets) in long-running programs. They are collected and released more efficiently, reducing memory use. In simpler terms - we closed what looked like a memory leak
- Windows builds now link statically to zlib, expat, bzip, and openssl-1.0.1i
- Many issues were resolved since the 2.3.1 release in June
You can download PyPy 2.4.0 here https://pypy.org/download.html.
We would like to also point out that in September, the Python Software Foundation will match funds for any donations up to $10k, so head over to our website and help this mostly-volunteer effort out.
PyPy is a very compliant Python interpreter, almost a drop-in replacement for CPython 2.7 and 3.2.5. It's fast (pypy 2.4 and cpython 2.7.x performance comparison) due to its integrated tracing JIT compiler.
This release supports x86 machines running Linux 32/64, Mac OS X 64, Windows, and OpenBSD, as well as newer ARM hardware (ARMv6 or ARMv7, with VFPv3) running Linux.
We would like to thank our donors for the continued support of the PyPy project.
The complete release notice is here.
Please try it out and let us know what you think. We especially welcome success stories, please tell us about how it has helped you!
Cheers, The PyPy Team
PyPy 2.4-beta just in time for PSF's funding drive
This release contains several bugfixes and enhancements. Among the user-facing improvements:
- internal refactoring in string and GIL handling which led to significant speedups
- improved handling of multiple objects (like sockets) in long-running programs. They are collected and released more efficiently, reducing memory use. In simpler terms - we closed what looked like a memory leak
- Windows builds now link statically to zlib, expat, bzip, and openssl-1.0.1i
- Many issues were resolved since the 2.3.1 release in June
You can download the PyPy 2.4-beta1 release here https://pypy.org/download.html.
We would like to also point out that in September, the Python Software Foundation will match funds for any donations up to $10k, so head over to our website and help this mostly-volunteer effort out.
PyPy is a very compliant Python interpreter, almost a drop-in replacement for CPython 2.7 and 3.2.5. It's fast (pypy 2.4 and cpython 2.7.x performance comparison) due to its integrated tracing JIT compiler.
This release supports x86 machines running Linux 32/64, Mac OS X 64, Windows, and OpenBSD, as well as newer ARM hardware (ARMv6 or ARMv7, with VFPv3) running Linux.
We would like to thank our donors for the continued support of the PyPy project.
The complete release notice is here.
Please try it out and let us know what you think. We especially welcome success stories, please tell us about how it has helped you!
Cheers, The PyPy Team
News Flash from the beta release cycle:
- Note that the beta release mistakenly identifies itself in sys.pypy_version_info as releaselevel=='final', please do not mistake this for a final version
- The beta can hit a "Illegal instruction" exception in jitted code on ARMv6 processors like the RaspberryPi. This will be fixed for the release.
Short testing note:
./pypy: error while loading shared libraries: libtinfo.so.5: cannot open shared object file: No such file or directory
64 bit Linux version tested on Gentoo GNU/Linux.
Is there a chance to get pylab/matplotlib running with pypy as you showed in 2011?
I just had a very, very good experience with pypy:
https://github.com/ArneBab/freenet-fundraising/blob/master/slides.org#routing-simulation
with cpython it needs a day for a simulation with 100k nodes. In pypy it needs a few minutes!
Python Software Foundation Matching Donations this Month
We're extremely excited to announce that for the month of September, any amount
you donate to PyPy will be match (up to $10,000) by the Python Software
Foundation.
This includes any of our ongoing fundraisers: NumPyPy, STM, Python3, or our
general fundraising.
Here are some of the things your previous donations have helped accomplish:
- Getting PyPy3 completed (currently 3.2, with 3.3 work underway)
- New research and production engineering on STM for PyPy
- Lots of progress on NumPy for PyPy
- Significant performance improvements
You can see a preview of what's coming in our next 2.4 release in the draft
release notes.
Thank you to all the individuals and companies which have donated so far.
So please, donate today: https://pypy.org/
(Please be aware that the donation progress bars are not live updating, so
don't be afraid if your donation doesn't show up immediately).
I think you should be careful about your claims for numpy. It's a great idea and I am sure lots of people would be very interested in anything you do but I for one see very little progress on it.
Consider me another request for a Bitcoin address. I'm in for a few millibits if you provide one.
This is outstanding work, PyPy team. Keep on keeping on!
Fantastic!
https://pypy.org/performance.html states that large dicts are a weakness of pypy -- is still the case overall, or is this work sufficient to favour pypy over cpython for large dict work in general?
Wilfred - With the ordered dict changes that bullet item is no longer true.
Awesome work and thanks. Pypy would be ahead of the game if PEP 468 were accepted.
How is deleting an element implemented? It sounds like it would take O(n) work to remove an element from the middle of the compact array.
JSZ: the array gets holes. If a lot of items are deleted it can no longer be called "compact", but if it becomes too sparse it is recompacted and rehashed.
There are lots of things to like about this approach!
Did you find any problems with cache misses? With linear probing, the keys are accessed sequentially (cache friendly), but with this method the keys are accessed in random order.
@Anonymous: The old approach didn't use linear probing either, so in that regard nothing changed.
@carl - ah I see, thats interesting.
Well then, what about storing the hashes with the indices?
* Another chunk of memory saved. Only the lowest N bits need be stored that way instead of the full 64 bits. (Big assumption that rehashing on bit size change is ok)
* The nice thing is that the dense part (cache miss!) need only be accessed if the hash matches.
I think if I was doing this, I'd skip 8 bit indices and have 16 bit minimum so rehashing would be very rare.
two problems with that:
- since the hash functions can be written in python, recomputing a hash from a key is potentially expensive
- why would you want to throw away bits from the hash? comparing the full hashes as a first check to see whether equality has a chance to succeed is very useful. the equality function can again be written in python, so is potentially very slow.
@Anonymous: about starting at 16-bit instead of 8-bit: it doesn't give any benefit, because rehashing is needed anyway to grow the sparse table. As long as its size is at most 256, then there is no point in storing 16-bit numbers instead of 8-bit numbers. In theory we could store N-bit numbers for the optimal value of N (= 4, 6, 8, 10...) and pay only the cost of additional complexity for individual reads and writes, not for rehashing.
Ah indeed. I am thinking of implementing this in C++ which has coloured my thoughts somewhat. In my case, key equality checks are for the most part cheap. Thus the size/compute tradeoffs may be a bit different.
Thanks for your thoughts.
Just curious, was there no slowdown from adding this extra level of indirection? For the case of accessing a random key from a cold dictionary, won't the lookup incur 2 cache misses now (one on each array), compared to just 1 for the original design?
@Durtin: there are certainly slow-downs in some cases. If the dictionary is cold, then indeed there is one extra cache miss. It seems to be quickly compensated, though, by the fact that if then you do a few more accesses to the same dict, you are likely to get less cache misses, simply because of the more compact layout. Also, the index array is often single bytes, so it can be fully in the cache very quickly.
Thank you for improving pypy performance and features. Your project and method is promising in improvement weakness aspect of dynamic languages. At the same time, pypy should provide an simplicity of Python rather than diversity , where diversity is the reality but simplicity is the case.
Making dictionaries ordered by default is part of simplicity; in this effort I wish integrating the features of "defaultdict" as method and properties of the the default basic dictionary.
similar case , integrating "deque" features (as well ,method and properties) as part of pypy list datatype.
Usually I wonder why python team didn't integrate the features of these "collections" ( as they say "High-performance container datatypes" ) within original python basic datatype, as we all know , everything in Python is an Object. and I don't think it is a pythonic way to do things in diversity.
Anyhow , keep on your development and team spirit.
@Alhabshi3k: indeed, you're right in that "defaultdict" could be replaced with an alternate constructor of the regular dicts. I'm not sure why it is not so. For deques, it is maybe a question of performance, but particularly of underlying C-level memory layout: CPython can't easily add appendleft() and popleft() to regular lists while still keeping the same C API, notably PyList_GET_ITEM() and PySequence_Fast_ITEMS() --- though that is debatable.
We could support that in PyPy, but that is arguably more of a language change than just making dicts ordered with no new user-visible API.
You say for 100 elements, the new design's compact array uses 3 * WORD * 100 memory, right? So no extra capacity whatsoever? Then what do you do when I insert another element? Allocate a new array with 3 * WORD * 101 memory and copy all data there (and write the new element at the end)? That would be highly inefficient. So I don't believe you're honest about the memory usage.
The actual items are stored in a list which, like a list object, is slightly overallocated. Maybe the text in the blog post missed that and it should add a "k": the average is "100 * 12/7 + 3 * WORD * 100 * k" for an average value of k around 17/16. That's around 2700 instead of 2600.