Page cover

Dev's rocket-jump into Python 3.12 and perf

Python 3.12 through the eyes of a backend engineer

The tastiest features

Python 3.12arrow-up-right was released on October 2023 and contains many interesting, many performance optimizations and functional improvements.

The most significant changes from the point of view of backend developers are:

  1. Full-fledged support of Linux perfarrow-up-right profiler

  2. A lot of performance optimizationsarrow-up-right

Let's dive into changes and evaluate their impact on performance and our problem-solving capability.

The perf era began

Preparation

For using perf you need following:

  1. Install right version of linux-tools, e.g on Ubuntu following instructionsarrow-up-right may be used

  2. Ensure that you used Python is build with following compiler flags: FLAGS="-fno-omit-frame-pointer -mno-omit-leaf-frame-pointer"

  3. Configure the Linux kernel:

sudo sysctl -w kernel.kptr_restrict=0 
sudo sysctl -w kernel.perf_event_paranoid=-1

Python interpreter from Ubuntu 22.04 doesn't have such options, but it shouldn't be a problem in container environment - you should just need to recompile your base image with custom Python build included.

Perf basic usage

For running python program under perf following command may be used:

Or you may attach to existing process PID and collect some events:

Received performance data may be visualized using Flamagrapharrow-up-right scripts from Brendan Gregg:

Case #1: string concatenation

Let's start from the simple piece of code which has a lot of string concatenations. This code definitely has a performance issue because Python strings are immutable:

100M repeats of function above certainly shows a lot of CPU usage, but where we spend all this time?

Running the code under the perf tool generates the following output which looks pretty weird for performance neophytes, but fairly interesting:

Flame graph for slow string_concat

Using flame graph above we can see that almost 30% time our code spends in PyUnicode_Append method and inside PyUnicode_Append we can see a lot of memory reallocations(unicode_resize) and compaction calls (resize_compact).

Let's apply the standard optimization - use list for gathering all the substrings:

I discover ~30% better time which has agreed with internals, which we see in the flame graph above:

Flame graph for string_concat with join optimization

A new flame graph showed that we eliminate string resize, but still have to do a lot of work in a cycle.

Can we do better? Probably yes, but I only found an option to optimize this particular string generator, because it uses the same pattern - always add the string test to the output string.

Fully optimized version of code will look like:

Beautiful flame graph for optimized version of string_concat

The final flame graph looks beautiful and optimal.

Case #2: logging and string formatting

I've seem hundred times pylint warning about improper usage of the logging format:

But how expensive it will be if we ignore this warning and use python formatting in our production code?

For verification I've created a couple of scripts which print an object to stderr on debug level with actual logging level equal to info.

Logging with f-string formatting

Optimized version of script above has only one changed line:

And results are much better(-39.18%):

The flame graf below shows that string formatting disapper and this was the root cause of extra cpu usage:

Logging with lazy string formatting.

In real-world application you can have thousands of debug messages and in case of service input load about 1K RPS CPU usage may be significant.

Case #3: reading file with a buffer

The next case is about reading data from file and calculating lines and characters(wc -lc):

For this test I created a text file with 10M lines, overl size is 257M.

The code looks pretty nice, but which buffer size is optimal? Let's select this parameter based on experiments:

buf_size, bytes
time, s

1

42.095

10

4.541

100

0.751

1000

0.287

10K

0.207

100K

0.172

1M

0.225

10M

0.274

100M

0.277

1B

0.313

10B

0.327

100B

0.311

A graph which shows time of reading file in depends on buffer size

As you can see, an optimal buf_size is about 100 Kbytes. And this number may vary in depends on OS and Python versions. Buy why we have a performance degradation on buffer sizes more thank 100K? Then answer may be found using perf and a flame graph below:

Read from file using 1B buffer

As you can see in case of big buffer a system lost some time in exc_page_fault which is obviously connected with handling page fault exception inside Linux kernel.

Play with sys.monitoring

sys.monitoring is a module for catching variety of events in Python code with low overhead. All available events are described in documentationarrow-up-right. It may be used by debuggrs, profilers, code coverage utils or custom tools.

Unfortunately, official Python docs don't have any meaningful examples or howto. The simplest use-case for counting exceptions in your service may look like:

A custom identifier TEST_TOOL_ID is defined and registered using use_tool_id call. Then set_events is called for looking only for specified type of events - RAISE and RERAISE. Finally, callbacks are registered for counting separately exceptions and exception reraise. The whole code may be found herearrow-up-right.

I/ve received rather unexpected results for basic script with exceptions counter and have to read PEP 669arrow-up-right carefully, Then I asked Mark Shannon directly about such behavior, but haven't received an answer yet. More details may be found herearrow-up-right.

I suppose, that now it would be difficult to use sys.monitoring for solving some problems of backend developers in real production, it's a framework for creating other useful developer's tools.

Performance tests

Environment details

Hardware info:

OS/Kernel info:

Python versions:

All Python are build from source with the same configuration:

CFLAGS="-fno-omit-frame-pointer -mno-omit-leaf-frame-pointer" ./configure --prefix=/home/dr/local_install/cpython --enable-optimizations

Benchmarks game

In the year 2001(I was a pupil in school that time) Doug Bagley started Benchmarks game project for comparison performance of different languages and this projects is still alive because of infinite curiosity of other great people and continue its life herearrow-up-right.

It's not very convenient for comparing performance of different Python versions, however it provided a good set of program to use as benchmarks.

Here is a short description of each test:

Test
Description

Allocate and deallocate many many binary trees

Widely known in narrow circles FANNKUCH Listp benchmark.

Generate and write random DNA sequences

Generate Mandelbrot set portable bitmap file

Double-precision N-body simulation

Eigenvalue using the power method

Original simple mandelbrot program from 2004 & 2005

Ten-lines to loop N million times and sum the Gregory Series Pi

Match DNA 8-mers and substitute magic patterns

Hashtable update and k-nucleotide strings

Read DNA sequences - write their reverse-complement

Streaming arbitrary-precision arithmetic

Here is results of measurements, obtained using hyperfinearrow-up-right tool with following input:

Test info
median, s
delta, %
cpu(user, system), s

python 3.10, N=14

0.258

-

1.26, 0.04

python 3.11, N=14

0.185

-28.29

0.86, 0.04

python 3.12, N=14

0.188

+1.62

0.85, 0.04

python 3.10, N=21

37.60

-

252.41 2.54

python 3.11, N=21

27.59

-26.62

179.09 2.66

python 3.12, N=21

27.75

+0.58

180.44 2.62

Test info
median, s
delta, %
cpu(user, system), s

python 3.10, N=50000

3.90

-

3.89, 0.00

python 3.11, N=50000

2.42

-37.95

2.42 0.00

python 3.12, N=50000

2.28

-5.79

2.30 0.00

python 3.10, N=500000

39.32

-

39.30, 0.00

python 3.11, N=500000

24.17

-38.53

24.20 0.00

python 3.12, N=500000

22.38

-7.41

22.39 0.00

pidigits - failed to build dependent library python-gmpy2, see a bug report herearrow-up-right.

regexredux, knucleotide, revcomp don't show any significant changes of performance for all tested Python versions

I observe major performance improvements in Python 3.11 in comparison with 3.10 in binarytrees, fannkuch_redux, fasta, nbody, spectralnorm, simple, too_simple benchmarks.

If we see on Python 3.12 vs Python 3.11 - situation is not so obvious. For nbody and simple benchmarks performance is better, and for mandelbrot, spectralnorm, too_simple I see some degradation.

Detailed results may be found in github repo with all test resultsarrow-up-right.

Verify asyncio performance boost

The best way to verify performance is to run production-like workload on real-world service. However, it may be inconvenient, because you can't share all results and code of your service to the wide audience. To verify asyncio performance I've designed and developed a simple inventory servicearrow-up-right. It's akin to services which I created during my career in several gamedev company and released under permissive license for further tests or just for results verification.

The inventory service provides a set of APIs for working with inventory:

Inventory service basic API

API designed using Open API 3.0 specarrow-up-right. For simplicity, I implemented and tested only /v1/inventory/read and /v1/inventory/grant APIs.

A short description of tech stack looks like:

  • Python [3.10.. 3.12]

  • Asyncio + Aiohttp + aiohttp_swagger3

  • asyncpg

  • PostgreSQL [14..16]

For generating load an awesome tool wrk2arrow-up-right was used. Also I created a couple of simple Lua-scripts for generating real-world like workload. However, all tests a little bit artificial, because I've never released this service to real production.

Test
Description

write

100% of grant requests, also used for initial DB preparing

read

100% of read requests

mix

50% grant, 50% of read requests

metrics

getting prometheus API metrics (database free workload)

Target request rate was picked up to have CPU usage for Python worker about 80-85%. Tests performing against single worker which was deployed to bare-metal host using virtualenv tool. PostgreSQL was deployed to the same host, worker can't create more than 10 connections to PostgreSQL(max_pool_size=10).

Wrk command like looked like:

Here is the table with results:

Test info
actual RPS
latency, 50%, ms
latency, 99%, ms

python 3.10, -R800 -c 100 -d 120

798.64

10.09

63.97

python 3.10, -R900 -c 100 -d 120

892.32

80.45

785.41

python 3.11, -R900 -c 100 -d 120

898.50

6.18

62.27

python 3.11, -R1000 -c 100 -d 120

998.30

8.85

237.31

python 3.12, -R900 -c 100 -d 120

898.51

5.88

50.08

python 3.12, -R1000 -c 100 -d 120

998.30

11.82

331.34

Detailed wrk reports and some info about system load may be found herearrow-up-right.

As you can see, Python 3.11/3.12 outperforms 3.10 in write test for about 10-15%. For read/mix/metrics tests latency/throughput are pretty similar for tested Python versions.

Tech debt returning & fun

This is absolutely right idea to plan tech debt returning and do it according to schedule. According to PEP 594arrow-up-right several modules were removed from standard library:

  • asynchat and asyncore - don't need them anymore because asyncio is the winner

  • distutils - available as third-party module

  • imp - importlib should be used instead

All removed and deprecated entities are specified in changelogarrow-up-right.

I personally love a small fun feature - command line interface for uuid module!

Now I can not make JSON pretty using Python json.tool, but also generate UUIDs and use curl more efficiently for API exploratory testing:

Conclusion

I suppose that the text and numbers above were a little interesting for backend engineers. As you can see, Python core team did a great work and you may expect significant performance boost for your code(10-15%).

However, as usual it's better to test a new version of Python and libraries on your own workload before pushing such a big change to production. And also some libraries have not supported Python 3.12 yet.

Obviously, the biggest performance boost you receive from Python 3.11 release. It would be also interesting to investigate why 3.12 has some degradations for some benchmarks.

Also it's a good idea to start using perf for understanding behaviour of your systems under load. This is a really poweful tool, which may be used in cloud and bare-metal environments. If you are inspired to try it, you can visit its home pagearrow-up-right. Also this use casearrow-up-right from Liz Rice may be interesting..

Last updated