Make it Async

or, the Things that sehe Told Me About
Building Shared Async Resources with ASIO

The expert invariably means the person with the most patience.

— sehe

I think it’s clear that ASIO is both one of the most important libraries in C++ yet to achieve standardization, and one of the worst documented C++ libraries to ever hold such a prestigious position. Trying to learn the ins-and-outs of ASIO’s model without spending arduous amounts of time reading the source code is borderline impossible. Learning the best practices for use cases beyond the trivial examples is Sisyphean.

The state of things is such that the premier advice for learning ASIO in the C++ community is, “just ask sehe. Relying on a single StackOverflow account as a tutorial mechanism would be catastrophic for most projects, but sehe is so active, insightful, and patient that ASIO almost gets away with it.

Anyway, this is some stuff I learned about using shared resources asynchronously by listening to sehe. I’ll be demonstrating by bolting functionality onto a little echo server and discussing a few other ASIO-isms along the way.

A server! A server! My kingdom for a server!

Brevity is the heart of wit, and all the world has written an echo server. I will not patronize the reader with another such implementation here. That said, complete, compilable, runnable implementations of each figure are available in this repository.

However, in the following examples we will be integrating an additional asynchronous resource into such a server, and to do that we will need the bones of an ASIO client handler. I present my preferred arrangement of bones in Figure 1. This uses the C++20 coroutine-style of asynchronicity, which has rapidly become my preferred mechanism.

We’re using one notable ASIO “trick” here, asio::deferred. In the spirit of ASIO, nowhere is the interaction between the deferred completion token1 and the C++20 co_await operator documented (this gave me a brief existential crisis a few months ago). Normally you would have to ask sehe what’s going on here, but I’ll give you this one for free.

Async functions invoked with deferred create functors which represent deferred async operations, operations which are only started when said functors are invoked with another completion token. Invoking a deferred operation with a deferred completion token is a no-op, it remains a deferred operation.

So what the hell will co_await do with that functor?

It will start the async operation with the coroutine itself as the completion token! Or, you know, something conceptually akin to that. C++20 coroutines are very hard. The point is the awaiting coroutine will resume following completion of the async operation, with any results of said operation returned via the co_await operator.

The same effect can be achieved with the asio::use_awaitable completion token, but the advantage of deferred is no coroutine frame is allocated. The async operation itself is not a coroutine, so there’s no need to shoulder the burden of the coroutine frame allocation.

Is that intuitive? Is that obvious from the general semantics of deferred? If you say yes, you’re a more powerful wizard than I and likely don’t need the rest of this post.

To block or not to block, that is the question

That was a fun diversion, but back to the task at hand. What if, instead of building the behavior of the server directly into the C++, we want to delegate that task to a higher level langauge like Python? We’re serious C++ programmers, our job is to move bytes around as fast as possible, lesser mortals can concern themselves with the contents of those bytes.

First let’s consider the direct approach, we’ll call into CPython. CPython requires us to hold the Global Interpreter Lock (GIL) before we muck with anything.2 We do this with the PyGILState_Ensure and PyGILState_Release functions.

Figure 2 illustrates an outline of how we could go about this, minus error checking and some other Python minutia.

Clearly our client_handler is getting a bit long, the Python code begs to be abstracted into its own function, but the real issue is that locking call. When we grab the GIL our executor, the execution resource which is running our coroutine, becomes completely unavailable. We’re no longer stringing together co_awaits, we’ve done a synchronous blocking operation that prevents this executor from being used elsewhere.

What we need is the ability to suspend our handler until the job is complete, as is done with async_read and async_write.

That which we call blocking,
by any other name still blocks

The fundamental unit of asynchronicity in ASIO is the async_result trait,3 which is covered in the ASIO documentation reasonably well. As mentioned in that documentation, we will never touch this trait, always using its helper function async_initiate to handle the twin C++ footguns of type decay and forwarding.4

I won’t repeat the ASIO documentation here, but suffice to say async_initiate is the “make it async” secret sauce of ASIO. You have some function you want to use with ASIO’s model, all you need do is pass it through async_initiate and you’re there.

Well ok, there’s a little bit of work to do. async_initiate wants you to invoke a completion handler at the conclusion of whatever operation you’re performing. This is a proxy for the callback, the coroutine, the intermediate completion object, whatever, that is waiting on the completion of your async operation.

Figure 3A presents an implementation that uses this machinery to run our Python operations.

Not so bad, however there’s a minor ASIO-ism in the invoking of the completion handler. Note that it is std::move’d before we invoke it with the result of our Python operation. Completion handlers are single-use disposable functors, and to prove to ASIO we understand this we must move them prior to invocation.

Now that we’ve got our own async operation, we can refactor our client handler into Figure 3B. Not only is all the nasty Python code factored out, but we can suspend via co_await until the operation has completed.

But what have we really done here? co_await will suspend our coroutine and initiate the blocking Python operation. That operation will run, by default and because we didn’t intervene, on the current executor… which will block trying to acquire the GIL.

We didn’t change anything.

Well, we made the code a little cleaner and more consistent, but functionally nothing has changed here, our executor still ends up blocked waiting for the GIL. If we had 12 threads serving as executors, they would all end up sitting around waiting for the GIL.

This is somewhat natural. The GIL is a bottleneck for this program. However, imagine a case where the Python interpreter was fast enough to serve two or three IO threads handling the accept, read, and write calls, with a dedicated Python thread handling the data processing. Instead of servicing those IO operations, the executors spend all their time waiting on the GIL only servicing IO calls in their brief moments not blocking on it.

We can imagine many resources that might fall into a similar category: memory pools, hardware devices, stateful thread-unsafe libraries, which all serve as concurrency bottlenecks but can achieve higher throughput if properly managed.

I am an executor,
More dispatch’d to than dispatching

What we want is a queue of operations, a strand of execution, dedicated to just Python. We want executors to submit work to this strand, but not block on it, remaining available for any other work that comes along. When the Python operation is completed, we want the results handed back to the original executor which submitted the work.

The good news is ASIO provides direct support for such strands of execution, with the conveniently named asio::strand. A strand is an executor which guarantees non-concurrent execution of work submitted to it. This removes the need to manage any locks, GIL or otherwise, the strand will guarantee only a single operation is inside the Python interpreter at a time.5

To build an asynchronous resource which submits all of its work to a provided strand, we’ll use what I call the double-dispatch trick.6 The principles of this trick are demonstrated in Appendix A; here performed outside the context of an echo server to focus only on the properties of asio::dispatch.

dispatch is an asynchronous operation in the same way that async_read and are, which means we can hand it a completion token, or make it a deferred operation and co_await it as with the others.7

dispatch doesn’t read a socket or file, it doesn’t wait on a timer or signal, it doesn’t perform any work at all really. It immediately invokes the completion handler if it can, or schedules the completion handler to be run at the next oppertunity otherwise.

So what’s so special about it? It can invoke the completion handler on a different executor. The priority order for which executor dispatch uses (as I understand it) is as follows:

  1. The associated executor for the completion token.8

  2. An executor optionally passed as the first argument to dispatch.

  3. A default system executor, which will run the completion handler somewhere.

In the above example we co_await deferred dispatchs which dutifully invoke their completion handlers, resuming the coroutine on the associated executor. Since we bound the strand to the completion token for the first dispatch, the coroutine is resumed on the strand. The second dispatch returns execution to the coroutine’s home executor.

Figure 4 modifies the PythonApp to use the double-dispatch trick. No significant modification of the client_handler is necessary.

The first dispatch delegates to the designated PythonApp executor via the first-argument mechanism. This works because lambda we’re passing to the first dispatch as a completion token has no associated executor.

Inside that lambda is where the run_impl call is made, running the Python code now that execution has been shifted onto ex_ (presumably a strand). The results are then “appended” to the completion handler.

asio::append is somewhat analogous to std::bind_back, but for completion tokens. It produces a completion token which passes additional arguments to an underlying completion token. In this instance, the underlying completion token is the completion handler for the entire asynchronous operation, and the argument we want it to be invoked with is the result of the Python operation.

The completion handler has an associated executor, the executor it originated from. The second dispatch schedules the completion handler to be run on that executor, and execution inside the original coroutine is resumed by that completion handler.

Damn that was a lot. Everybody take five.

I am but mad North-North-West.
When the executors are blocked,
I know a hawk from a handsaw

Anyway that’s what I learned, I hope it’s useful. Writing it all out it’s not that much information, but it took weeks of nights and weekends experimenting and running into dead ends to work out this little demo.

When I was a kid I learned networking from reading Beej’s Guide to Network Programming, which got me started down the path of asynchronous programming generally. I struggled to build complete programs though, succeeding merely in little scripts which blasted hardcoded byte sequences into the aether.

Only after I saw a complete, working, practical implementation of the fundamentals Beej had taught me did it click. My first ever asynchronous network program was a Minecraft bot called SpockBot, which started as a wholesale copy of Barney Gale’s barneymc framework.

I think ASIO desperately needs the same. Complete, practical demonstrations that can be directly copied and expanded upon by application developers. It needs its own version of the Sascha Williams Vulkan-Samples, or the LLVM Kaleidoscope tutorial. Not toys, real bedrock developers can build on, supported by complete documentation beyond the spartan API references.

Also the double dispatch trick is stupid. ASIO should absolutely have a utility function for “run this thing on a different executor and dispatch the completion handler to its associated executor”. Just saying.

  1. Unfortunately I must assume some familiarity on the part of the reader with basic ASIO concepts like completion tokens. The ASIO documentation covers these decently, but if you’re unfamiliar with them it might be worthwhile to work through the ASIO tutorial code before tackling this post.

    I don’t think the constructs presented here make much sense without an intuitive understanding of the problem they’re trying to solve. ↩︎

  2. This is not strictly true under all conditions, and the GIL is undergoing a lot of changes these days. Someday in the future it may disappear entirely. But for right now with the version of Python 3.12 running on most machines, we need to grab the GIL. ↩︎

  3. I have no idea if this is true. It’s the way I see the iceberg from the tip I’m standing on. Sometimes I tie my shoes wrong. Caveat emptor. ↩︎

  4. There’s also async_compose for building operations that chain multiple other asynchronous operations. It does not spark joy. I use coroutines for this. ↩︎

  5. There exist complications with this approach, and strategies for managing those complications, which I’m eliding. They’re Python-specific, so don’t belong in a general discussion of designing asynchronous shared resources.

    The short version is we’ll be preventing Python-native threads from ever running. We will never release the GIL, and manage access to the Python interpreter with our strand. This is a reasonably common restriction for application server implementations. ↩︎

  6. Journeymen ASIO programmers will point out this is not a trick at all, but a natural exercise of ASIO’s intended capabilities. Unix programmers say much the same about the double-fork trick, and yet here we are. ↩︎

  7. dispatch has a very similar cousin, asio::post, and executors have a related underlying method .execute, which won’t be discussed. Reader exploration is encouraged, but it’s more than I have space for here. ↩︎

  8. In an asio::awaitable / co_await context this always defaults to the coroutine’s current executor if none other exists, thus never advancing past this step. ↩︎