Wednesday, May 11, 2022
The 2022 Python Language Summit: Upstreaming optimisations from Cinder
In May 2021, the team at Instagram made waves in the world of Python by open-sourcing Cinder, a performance-oriented fork of CPython.
Cinder is a version of CPython 3.8 with a ton of optimisations added to improve speed across a wide range of metrics, including “eager evaluation of coroutines”, a just-in-time compiler, and an “experimental bytecode compiler” that makes use of PEP 484 type annotations.
Now, the engineers behind Cinder are looking to upstream many of these changes so that CPython itself can benefit from these optimisations. At the 2022 Python Language Summit, Itamar Ostricher, an engineer at Instagram, presented on Cinder’s optimisations relating to async tasks and coroutines.
Asyncio refresher
Consider the following (contrived) example. Here, we have a function, IO_bound_function, which is dependent on some kind of external input in order to finish what it’s doing (for example, this might be a web request, or an attempt to read from a file, etc.). We also have another function, important_other_task, which we want to be run in the same event loop as IO_bound_function
import asyncio
async def IO_bound_function():
"""This function could finish immediately... or not!"""
# Body of this function here
async def important_other_task():
await asyncio.sleep(5)
print('Task done!')
async def main():
await asyncio.gather(
IO_bound_function(),
important_other_task()
)
print("All done!")
if __name__ == "__main__":
asyncio.run(main)
IO_bound_function could take a long time to complete – but it could also complete immediately. In an asynchronous programming paradigm, we want to ensure that if it takes a long time to complete, the function doesn’t hold up the rest of the program. Instead, IO_bound_function will yield execution to the other thing scheduled in the event loop, important_other_task, letting this coroutine take control of execution for a period.
So far so good – but what if IO_bound_function finishes what it’s doing immediately? In that eventuality, we’re creating a coroutine object for no reason at all, since the coroutine will never have to suspend execution and will never have to reclaim control of the event loop at any future point in time.
Call me maybe?
The team at Instagram saw this as an optimisation opportunity. At the “heart” of many of their async-specific improvements, Itamar explained, is an extension to Python’s vectorcall protocol: a new _Py_AWAITED_CALL_MARKER flag, which enables a callee to know that a call is being awaited by a caller.

The addition of this flag means that awaitables can sometimes be eagerly evaluated, and coroutine objects often do not need to be constructed at all.
Ostricher reported that Instagram had seen performance gains of around 5% in their async-heavy workloads as a result of this optimisation.
Pending questions
Significant questions remain about whether these optimisations can be merged into the main branch of CPython, however. Firstly, exact performance numbers are hard to come by: the benchmark Ostricher presented does not isolate Cinder’s async-specific optimisations.
More important might be the issue of fairness. If some awaitables in an event loop are eagerly evaluated, this might change the effective priorities in an event loop, potentially creating backwards-incompatible changes with CPython’s current behaviour.
Lastly, there are open questions about whether this conflicts with a big change to asyncio that has just been made in Python 3.11: the introduction of task groups. Task groups – a concept similar to “nurseries” in Trio, a popular third-party async framework – are a major evolution in asyncio’s API. But “it’s not completely clear how the Cinder optimisations might apply to Task Groups,” Ostricher noted.
Ostricher’s talk was well received by the audience, but it was agreed that discussion with the maintainers of other async frameworks such as Trio was essential in order to move forward. Guido van Rossum, creator of Python, opined that he could “get over the fairness issue”. The issue of compatibility with task groups, however, may prove more complicated.
Given the newness of task groups in asyncio, there remains a high degree of uncertainty as to how this feature will be used by end users. Without knowing the potential use cases, it is hard to comment on whether and how optimisations can be made in this area.
Python 3.11, if you haven’t heard, is fast. Over the past year, Microsoft has funded a team – led by core developers Mark Shannon and Guido van Rossum – to work full-time on making CPython faster. With additional funding from Bloomberg, and help from a wide range of other contributors from the community, the results have borne fruit. On the pyperformance benchmarks at the time of the beta release, Python 3.11 was around 1.25x faster than Python 3.10, a phenomenal achievement.
But there is more still to be done. At the 2022 Python Language Summit, Mark Shannon presented on where the Faster CPython project aims to go next. The future’s fast.
The first problem Shannon raised was a problem of measurements. In order to know how to make Python faster, we need to know how slow Python is currently. But how slow at doing what, exactly?
Good benchmarks are vital for a project that aims to optimise Python for general usage. For that, the Faster CPython team needs the help of the community at large. The project “needs more benchmarks,” Shannon said – it needs to understand more precisely what the user base at large is using Python for, how they’re doing it, and what makes it slow at the moment (if it is slow!).
A benchmark, Shannon explained, is “just a program that we can time”. Anybody with a benchmark – or even just a suggestion for a benchmark! – that they believe is representative of a larger project they’re working on is invited to submit them to the issue tracker at the python/pyperformance repository on GitHub.
Nonetheless, the Faster CPython team has plenty to be getting on with in the meantime.
Much of the optimisation work in 3.11 has been achieved through the implementation of PEP 659, a “specializing adaptive interpreter”. The adaptive interpreter that Shannon and his team have introduced tracks individual bytecodes at various points in a program’s execution. When it spots an opportunity, a bytecode may be “quickened”: this means that a slow bytecode, that can do many things, is replaced by the interpreter with a more specialised bytecode that is very good at doing one specific thing. The work on PEP 659 has now largely been done, but major parts, such as dynamic specialisations of for-loops and binary operations, are still to be completed.
Shannon noted that Python also has essentially the same memory consumption in 3.11 as it did in 3.10. This is something he’d like to work on: a smaller memory overhead generally means fewer reference-counting operations in the virtual machine, a lower garbage-collection overhead, and smoother performance as a result of it all.
Another big remaining avenue for optimisations is the question of C extensions. CPython’s easy interface with C is its major advantage over other Python implementations such as PyPy, where incompatibilities with C extensions are one of the biggest hurdles for adoption by users. The optimisation work that has been done in CPython 3.11 has largely ignored the question of extension modules, but Shannon now wants to open up the possibility of exposing low-level function APIs to the virtual machine, reducing the overhead time of communicating between Python code and C code.
Is that a JIT I see on the horizon?
Lastly, but certainly not least, Shannon said, “everybody wants a JIT compiler… even if it doesn’t make sense yet”.
A JIT (“just-in-time”) compiler is the name given for a compiler that dynamically detects where performance bottlenecks exist in a program as the program is running. Once these bottlenecks have been identified, the JIT compiles these parts of the program on-the-fly into native machine code in order to speed things up. It’s a similar idea to Shannon’s PEP 659, but goes much further, since the specialising adaptive interpreter never goes beyond the bytecode level.
The idea of using a JIT compiler for Python is hardly new. PyPy’s JIT compiler is the major source of the large performance gains the project has over CPython in some areas. Third-party projects, such as pyjion and numba, bring just-in-time compilation to CPython that’s just a
pip installaway. Integrating a JIT into the core of CPython, however, would be materially different.Shannon has historically voiced scepticism about the wisdom of introducing a JIT compiler into CPython itself, and said that work on introducing one is still some way off. A JIT, according to Shannon, will probably not arrive until 3.13 at the earliest, given the amount of lower-hanging fruit that is still to be worked on. The first step towards a JIT, he explained, would be to implement a trace interpreter, which would allow for better testing of concepts and lay the groundwork for future changes.
Playing nicely with the other Python projects
The gains Shannon’s team has achieved are hugely impressive, and likely to benefit the community as a whole in a profound way. But various problems lie on the horizon. Sam Gross’s proposal for a version of CPython without the Global Interpreter Lock (the
nogilfork) has potential for speeding up multithreaded Python code in very different ways to the Faster CPython team’s work – but it could also be problematic for some of the optimisations that have already been implemented, many of which assume that the GIL exists. Eric Snow’s dream of achieving multiple subinterpreters within a single process, meanwhile, will have a smaller performance impact on single-threaded code compared tonogil, but could still create some minor complications for Shannon’s team.