Python isn’t slow. The core eval loop, while slower than a JIT, is no slouch when it comes to dispatching bytecode. There is no reason that business logic written in Python which orchestrates the operation of highly optimized extension libraries should be a bottle neck.
Well, there are two reasons. The first is the infamous Global Interpreter Lock, the GIL, which forces serialization onto operations that might otherwise have been able to be executed in parallel by extensions. This is widely recognized and improving, the future appears GIL-less, or at least GIL-optional.
Yet the second I rarely see discussed except among hardcore extension devs:
PyObjects for blessed types like strings and dictionaries is damn
expensive. Every call to
PyUnicode_New() makes me quiver with fear, no matter
how optimized CPython’s small object allocator is. Worse yet, as long as the GIL
still walks among us, one must hold the GIL to create objects of the
The common solution is to abandon the world of blessed types. No
PyUnicodes, not even a measely
PyLong. Building your own types makes you
strong, like ox, or Mark Shannon. Once a type exists outside the blessed Python
type system you are free to create your own allocators and deallocators for it,
and you may optimize its operations as you see fit.
Of course, you now must do a great deal of work to ensure that
interops with all the other Python types, and god forbid it ever encounters a
PetersDict. You’re also banned by international treaty (and several local
state laws) from ever interacting with Python standards like WSGI, which dictate
only pure-bred Python types may be used.
The uncommon solution to this problem is something my
creative friends and I call GIL Balm’ing (the name, of course,
is derived from a sacred text
of the Italian American native faith: raised Catholic). In a sentence, GIL
Balm’ing is the act of intrusively adapting a native CPython type to be used
outside the context of the GIL.
In the simplest case, this means replacing
of the type you are Balm’ing with something more suitable (often
Python type is a little different and no two applications the same, so the exact
steps to get this working depends on the use case.
str type serves as a useful example because we have two major use
cases that lots of applications benefit from:
Converting string views into
strs without performing a copy
strobjects for use without grabbing the GIL or allocating
str is also useful because its structure is amenable to various
The Measure of a String
In order to Balm the
str, we must first understand the
str. The CPython
source code is extensively commented
and from it we learn there are three string structures:
- Compact ASCII (
PyASCIIObject): Promises to store only ASCII bytes, and stores the data immediately following the object in memory
- Compact (
PyCompactUnicodeObject): Similar to the above, but for Unicode encodings
- Legacy String (
PyUnicodeObject): Stores data in a separate buffer pointed to by the object, may or may not be ASCII
This brings us to the second point about GIL Balm’ing, this is a general technique, not a specific library or code recommendation. Python strings are flexible structures that can be adapted to many applications, which you choose to do is up to you. Will you support Unicode or only ASCII? Do you want to take advantage of the compact string representations or just plug pointers into legacy strings?
Now that we know the string, we can adapt it. For this demonstration we’ll be targeting the string view and pooling uses described above (with a fallback for strings that don’t fit inside our compact representation), but only for ASCII strings. Figure 1 demonstrates a structure we’ll be exploring.
The first thing to note is creating a new C type is not a necessary part of GIL Balm’ing, often we can re-use the structures from CPython directly. Here we do so in order to have one unified string type for both of our use cases, string views and compact strings.
The second is that we’re replicating the internals of the
an anonymous structure. Sometimes this sort of replication is necessary, either
because CPython doesn’t expose the necessary structures (looking at you,
dict), or because we want to do a little smuggling.
BalmString, we want to store some state data inside the structure and
PyASCIIObject very conveniently has a bunch of unused padding bits. It’s a
simple thing to use some of those for our state data.
The Author Hasn’t Read Knuth
There is probably a very efficient way to collect and dispatch a group of pre-allocated objects. If you know of one, please implement it and use it and stuff, that sounds great.
We’re going to use this list-stack thing in Figure 2.
I don’t think this code needs too much commentary for those who can read C. For those who can’t, thanks for tagging along, we’re happy to have you.
We’re going to be creating a list or a queue or something, whatever, a pool
BalmStrings. We can
push to and
pop from the pool instead
of having to allocate and deallocate objects from the memory manager. If the
pool is empty, we can allocate more
BalmStrings using whatever allocation
strategy the user wants.
The only thing of note is the layout of the
pointer comes before the
BalmString itself. This makes things a little
more complicated than they need to be, because we have to do some pointer
arithmetic to recover a
BalmStringNode from a given
BalmString instead of
pointer casting between them directly.
However, recall the layout of a compact Python string, the data immediately
follows the object in memory. If we placed the
next pointer after the
str, it would be considered a part of the string. That’s nonsense, so we
make do with some macros to recover the
More interesting are the block allocators in Figure 3.
number of their respective object types and then iterate over the allocation
linking up the objects.
What’s interesting is the initialization we can do here. For the regular
balmstr_block_alloc we can only set up the parts of the
state that we know will always be true; its
ob_type will be
BalmString_Type, it will only contain ASCII strings, and the underlying data
will occupy one byte per code unit.
compactbalmstr_block_alloc we have more information. We can
balm state data that we smuggled into the padding bits.
This tell us, given a random
BalmString, the nature of the underlying data.
Your Type is My Type
The entire purpose GIL Balm’ing is to avoid the kosher practice of hand-crafting new Python types. That means we need to steal a pre-existing type and replace its allocators with our own. Figure 4 demonstrates exactly that.
balm_init (which will be called from the module initialization method
of whatever this gets embedded in) we copy the
PyUnicode_Type and replace
NULL. It won’t be valid to create objects of this type
Similarly, we replace
tp_dealloc with the appropriate deallocation function.
We also replace
tp_free and I’m not entirely certain if this is necessary,
but we’re already playing with fire here so best not tempt the gods.
To complete the ruse, we need to ensure an in-use
BalmString carries the
correct type and metadata with it. For that we have the functions in
Again, nothing revolutionary here for people who can read C. We’ve created two
New_BalmStringView, that pop a
object from the appropriate pool depending on what type of string the user
wants and how big it is.
For compact strings, the only work we need to do is setup the length field,
everything else was taken care of in the block allocator. For string view and
big strings, we now need to fill in the
balm state data and populate the
utf8 fields appropriately.
Additionally, for this particular Balm’ing, we’re using a reference count to
keep track of when to
free the underlying string data for the non-compact
string types. This might not be appropriate for your application, especially
for the string views. That’s another reason why Balm’ing is a technique, not a
Call Me Ishmael
What is an optimization? A miserable little pile of code. Unless it is proven, undeniably; trial by benchmark. Let the gods, Turing, Moore, Hennessy, and Patterson, decide which is the best approach.
For this trial I have divided Herman Melville’s Moby Dick somewhat arbitrarily into 7515 lines. I have attached a binary header to the file which describes the offset and length of each line. The goal is to construct a string for each line and deliver it as a 7515-length tuple to the Python interpreter.
Then repeat that exact process another 9999 times and we’re done. In other words, construct roughly 7.5 million Python strings as fast as you can.
And we did it, case closed, 2x speed up with this one weird trick. Except,
BalmStrings don’t use a great layout for constructing millions of strings.
If we allocated them in blocks we could go even faster…
And we did it, case closed, 5x speed up with this one weird trick. This benchmark is memory bound, it won’t benefit from threading. We would just be measuring the overhead of locking and unlocking the GIL.
But hey, it might be fun, let’s split the 10k runs across 16 threads, 650 runs each.
Disaster for our
BalmString implementation. The “naïve”
baseline is fully serial, it never releases the GIL to begin with, thus what
we’re measuring is pure locking and contention overhead.
BalmString becomes effectively serialized by the act of popping string views
from the pool, and the overhead of the highly granular locking strategy is
a crushing blow.
BalmBlock, which allocates all its strings as a block up front and thus needs
almost no locks at all, barely notices the GIL overhead. It is entirely memory
bound, with ~100ms of variance from the OS scheduler.
For highly parallel code, you could argue this is a 20x speed up.
Final Thoughts and Applications
GIL Balm’ing isn’t for everyone. It maybe shouldn’t exist at all, and perhaps I am the fool for even suggesting it. Certainly, it is a fragile, precarious strategy for squeezing speed out of CPython, but benefits from maintaining a very high level of “natural” compatibility with native Python code.
The codebases where I personally apply strategies like this are low-latency libraries and services. Application servers like gunicorn can add a full millisecond of latency to requests, when in principle they should be completely transparent. Some C extensions, notably FastWSGI, demonstrate that the act of packaging and forwarding HTTP requests to CPython should take only microseconds.
FastWSGI never releases the GIL, because it needs to construct Python objects. It cannot run the application in parallel with processing HTTP headers for the next request due to this restriction. GIL Balm’ing then becomes an act of desperation, of last resort when one is unwilling to sacrifice speed or convenience.