Balm in GILead

Fast string construction for CPython extensions

Speed is found in the minds of people

— Andrei Alexandrescu

Python isn’t slow. The core eval loop, while slower than a JIT, is no slouch when it comes to dispatching bytecode. There is no reason that business logic written in Python which orchestrates the operation of highly optimized extension libraries should be a bottle neck.

Well, there are two reasons. The first is the infamous Global Interpreter Lock, the GIL, which forces serialization onto operations that might otherwise have been able to be executed in parallel by extensions. This is widely recognized and improving, the future appears GIL-less, or at least GIL-optional.

Yet the second I rarely see discussed except among hardcore extension devs: building PyObjects for blessed types like strings and dictionaries is damn expensive. Every call to PyUnicode_New() makes me quiver with fear, no matter how optimized CPython’s small object allocator is. Worse yet, as long as the GIL still walks among us, one must hold the GIL to create objects of the blessed types.

The common solution is to abandon the world of blessed types. No PyLists, no PyUnicodes, not even a measely PyLong. Building your own types makes you strong, like ox, or Mark Shannon. Once a type exists outside the blessed Python type system you are free to create your own allocators and deallocators for it, and you may optimize its operations as you see fit.

Of course, you now must do a great deal of work to ensure that TimmysDict interops with all the other Python types, and god forbid it ever encounters a PetersDict. You’re also banned by international treaty (and several local state laws) from ever interacting with Python standards like WSGI, which dictate only pure-bred Python types may be used.

GIL Balm’ing

The uncommon solution to this problem is something my dumb challenged creative friends and I call GIL Balm’ing (the name, of course, is derived from a sacred text of the Italian American native faith: raised Catholic). In a sentence, GIL Balm’ing is the act of intrusively adapting a native CPython type to be used outside the context of the GIL.

In the simplest case, this means replacing tp_new, tp_free, and tp_dealloc of the type you are Balm’ing with something more suitable (often NULL). Every Python type is a little different and no two applications the same, so the exact steps to get this working depends on the use case.

The Python str type serves as a useful example because we have two major use cases that lots of applications benefit from:

Python str is also useful because its structure is amenable to various Balm’ing strategies.

The Measure of a String

In order to Balm the str, we must first understand the str. The CPython source code is extensively commented and from it we learn there are three string structures:

This brings us to the second point about GIL Balm’ing, this is a general technique, not a specific library or code recommendation. Python strings are flexible structures that can be adapted to many applications, which you choose to do is up to you. Will you support Unicode or only ASCII? Do you want to take advantage of the compact string representations or just plug pointers into legacy strings?

Now that we know the string, we can adapt it. For this demonstration we’ll be targeting the string view and pooling uses described above (with a fallback for strings that don’t fit inside our compact representation), but only for ASCII strings. Figure 1 demonstrates a structure we’ll be exploring.

The first thing to note is creating a new C type is not a necessary part of GIL Balm’ing, often we can re-use the structures from CPython directly. Here we do so in order to have one unified string type for both of our use cases, string views and compact strings.

The second is that we’re replicating the internals of the PyASCIIObject inside an anonymous structure. Sometimes this sort of replication is necessary, either because CPython doesn’t expose the necessary structures (looking at you, dict), or because we want to do a little smuggling.

For BalmString, we want to store some state data inside the structure and PyASCIIObject very conveniently has a bunch of unused padding bits. It’s a simple thing to use some of those for our state data.

The Author Hasn’t Read Knuth

There is probably a very efficient way to collect and dispatch a group of pre-allocated objects. If you know of one, please implement it and use it and stuff, that sounds great.

We’re going to use this list-stack thing in Figure 2.

I don’t think this code needs too much commentary for those who can read C. For those who can’t, thanks for tagging along, we’re happy to have you.

We’re going to be creating a list or a queue or something, whatever, a pool of pre-allocated BalmStrings. We can push to and pop from the pool instead of having to allocate and deallocate objects from the memory manager. If the pool is empty, we can allocate more BalmStrings using whatever allocation strategy the user wants.

The only thing of note is the layout of the BalmStringNode, the next pointer comes before the BalmString itself. This makes things a little more complicated than they need to be, because we have to do some pointer arithmetic to recover a BalmStringNode from a given BalmString instead of pointer casting between them directly.

However, recall the layout of a compact Python string, the data immediately follows the object in memory. If we placed the next pointer after the str, it would be considered a part of the string. That’s nonsense, so we make do with some macros to recover the BalmStringNode.

More interesting are the block allocators in Figure 3.

Both balmstr_block_alloc and compactbalmstr_block_alloc allocate len number of their respective object types and then iterate over the allocation linking up the objects.

What’s interesting is the initialization we can do here. For the regular balmstr_block_alloc we can only set up the parts of the PyASCIIStringObject state that we know will always be true; its ob_type will be BalmString_Type, it will only contain ASCII strings, and the underlying data will occupy one byte per code unit.

For the compactbalmstr_block_alloc we have more information. We can initialize the balm state data that we smuggled into the padding bits. This tell us, given a random BalmString, the nature of the underlying data.

Your Type is My Type

The entire purpose GIL Balm’ing is to avoid the kosher practice of hand-crafting new Python types. That means we need to steal a pre-existing type and replace its allocators with our own. Figure 4 demonstrates exactly that.

In balm_init (which will be called from the module initialization method of whatever this gets embedded in) we copy the PyUnicode_Type and replace the tp_new with NULL. It won’t be valid to create objects of this type within Python.

Similarly, we replace tp_dealloc with the appropriate deallocation function. We also replace tp_free and I’m not entirely certain if this is necessary, but we’re already playing with fire here so best not tempt the gods.

To complete the ruse, we need to ensure an in-use BalmString carries the correct type and metadata with it. For that we have the functions in Figure 5.

Again, nothing revolutionary here for people who can read C. We’ve created two functions, New_BalmString and New_BalmStringView, that pop a BalmString object from the appropriate pool depending on what type of string the user wants and how big it is.

For compact strings, the only work we need to do is setup the length field, everything else was taken care of in the block allocator. For string view and big strings, we now need to fill in the balm state data and populate the data, length, and utf8 fields appropriately.

Additionally, for this particular Balm’ing, we’re using a reference count to keep track of when to free the underlying string data for the non-compact string types. This might not be appropriate for your application, especially for the string views. That’s another reason why Balm’ing is a technique, not a library.

Call Me Ishmael

What is an optimization? A miserable little pile of code. Unless it is proven, undeniably; trial by benchmark. Let the gods, Turing, Moore, Hennessy, and Patterson, decide which is the best approach.

For this trial I have divided Herman Melville’s Moby Dick somewhat arbitrarily into 7515 lines. I have attached a binary header to the file which describes the offset and length of each line. The goal is to construct a string for each line and deliver it as a 7515-length tuple to the Python interpreter.

Then repeat that exact process another 9999 times and we’re done. In other words, construct roughly 7.5 million Python strings as fast as you can.

And we did it, case closed, 2x speed up with this one weird trick. Except, BalmStrings don’t use a great layout for constructing millions of strings. If we allocated them in blocks we could go even faster…

And we did it, case closed, 5x speed up with this one weird trick. This benchmark is memory bound, it won’t benefit from threading. We would just be measuring the overhead of locking and unlocking the GIL.

But hey, it might be fun, let’s split the 10k runs across 16 threads, 650 runs each.

Disaster for our BalmString implementation. The “naïve” UnicodeString baseline is fully serial, it never releases the GIL to begin with, thus what we’re measuring is pure locking and contention overhead.

BalmString becomes effectively serialized by the act of popping string views from the pool, and the overhead of the highly granular locking strategy is a crushing blow.

BalmBlock, which allocates all its strings as a block up front and thus needs almost no locks at all, barely notices the GIL overhead. It is entirely memory bound, with ~100ms of variance from the OS scheduler.

For highly parallel code, you could argue this is a 20x speed up.

Final Thoughts and Applications

GIL Balm’ing isn’t for everyone. It maybe shouldn’t exist at all, and perhaps I am the fool for even suggesting it. Certainly, it is a fragile, precarious strategy for squeezing speed out of CPython, but benefits from maintaining a very high level of “natural” compatibility with native Python code.

The codebases where I personally apply strategies like this are low-latency libraries and services. Application servers like gunicorn can add a full millisecond of latency to requests, when in principle they should be completely transparent. Some C extensions, notably FastWSGI, demonstrate that the act of packaging and forwarding HTTP requests to CPython should take only microseconds.

FastWSGI never releases the GIL, because it needs to construct Python objects. It cannot run the application in parallel with processing HTTP headers for the next request due to this restriction. GIL Balm’ing then becomes an act of desperation, of last resort when one is unwilling to sacrifice speed or convenience.