Dev's rocket-jump into Python 3.12 and perf

Python 3.12 through the eyes of a backend engineer

The tastiest features

Python 3.12 was released on October 2023 and contains many interesting, many performance optimizations and functional improvements.

The most significant changes from the point of view of backend developers are:

Full-fledged support of Linux perf profiler
New profiling API sys.monitoring
A lot of performance optimizations

Let's dive into changes and evaluate their impact on performance and our problem-solving capability.

The perf era began

Preparation

For using perf you need following:

Install right version of linux-tools, e.g on Ubuntu following instructions may be used
Ensure that you used Python is build with following compiler flags: FLAGS="-fno-omit-frame-pointer -mno-omit-leaf-frame-pointer"
Configure the Linux kernel:

sudo sysctl -w kernel.kptr_restrict=0 
sudo sysctl -w kernel.perf_event_paranoid=-1

Python interpreter from Ubuntu 22.04 doesn't have such options, but it shouldn't be a problem in container environment - you should just need to recompile your base image with custom Python build included.

Perf basic usage

For running python program under perf following command may be used:

$ perf record -F 9999 -g -o perf.data python3 -X perf script.py

Or you may attach to existing process PID and collect some events:

$ perf record -F 9999 -g -o perf.data -p PID -- sleep 60

Received performance data may be visualized using Flamagraph scripts from Brendan Gregg:

$ perf script |  FlameGraph/stackcollapse-perf.pl > out.perf-folded
$ FlameGraph/flamegraph.pl out.perf-folded > perf.svg

Case #1: string concatenation

Let's start from the simple piece of code which has a lot of string concatenations. This code definitely has a performance issue because Python strings are immutable:

def string_concat(n):
    s = ""
    for _ in range(n):
        s += "test"
    return s

100M repeats of function above certainly shows a lot of CPU usage, but where we spend all this time?

$ python3.12 perf_string_concat.py
python3.12 perf_string_concat.py  3,49s user 0,10s system 99% cpu 3,591 total

Running the code under the perf tool generates the following output which looks pretty weird for performance neophytes, but fairly interesting:

Using flame graph above we can see that almost 30% time our code spends in PyUnicode_Append method and inside PyUnicode_Append we can see a lot of memory reallocations(unicode_resize) and compaction calls (resize_compact).

Let's apply the standard optimization - use list for gathering all the substrings:

def string_concat_fast(n):
    s_list = []
    for _ in range(n):
        s_list.append("test")
    return "".join(s_list)

I discover ~30% better time which has agreed with internals, which we see in the flame graph above:

$ python3.12 perf_string_fixed.py 
python3.12 perf_string_fixed.py  2,20s user 0,19s system 99% cpu 2,388 total

A new flame graph showed that we eliminate string resize, but still have to do a lot of work in a cycle.

Can we do better? Probably yes, but I only found an option to optimize this particular string generator, because it uses the same pattern - always add the string test to the output string.

Fully optimized version of code will look like:

def string_concat_fast(n):
    s_list = ["test"] * n
    return s_list

$ python3.12 perf_string_fixed_v2.py
python3.12 perf_string_fixed_v2.py  0,10s user 0,15s system 99% cpu 0,254 total

The final flame graph looks beautiful and optimal.

Case #2: logging and string formatting

I've seem hundred times pylint warning about improper usage of the logging format:

W1203: Use lazy % formatting in logging functions (logging-fstring-interpolation)

But how expensive it will be if we ignore this warning and use python formatting in our production code?

For verification I've created a couple of scripts which print an object to stderr on debug level with actual logging level equal to info.

import logging
import sys

class ObjectForLog:
    def __init__(self, x, y, z):
        self.x = x
        self.y = y
        self.z = z

    def __str__(self):
        return f'ObjectForLog(x={self.x}, y={self.y}, z={self.z})'

    def __repr__(self):
        return self.__str__()

if __name__ == '__main__':
    n = 10000000
    o_for_log = ObjectForLog(100, 500, 0)
    for i in range(n):
        logging.debug(f'one more debug message {o_for_log}')

$ time python3.12 logging_perf_issue_noprint.py              
python3.12 logging_perf_issue_noprint.py  7,04s user 0,00s system 99% cpu 7,037 total

Optimized version of script above has only one changed line:

logging.debug('one more error message %s', o_for_log)

And results are much better(-39.18%):

$ time python3.12 logging_perf_issue_proper_format_noprint.py                                                                                                                                                                          130 ↵
python3.12   4,28s user 0,00s system 99% cpu 4,279 total

The flame graf below shows that string formatting disapper and this was the root cause of extra cpu usage:

In real-world application you can have thousands of debug messages and in case of service input load about 1K RPS CPU usage may be significant.

Case #3: reading file with a buffer

The next case is about reading data from file and calculating lines and characters(wc -lc):

def wc(file_object, buf_size):
    char_count = 0
    line_count = 0
    while True:
        data = file_object.read(buf_size)
        if not data:
            break
        line_count += data.count('\n')
        char_count += len(data)

    return line_count, char_count

$ perf_read_with_small_buffer.py -s $buf_size -f big_test.txt

For this test I created a text file with 10M lines, overl size is 257M.

The code looks pretty nice, but which buffer size is optimal? Let's select this parameter based on experiments:

buf_size, bytes

time, s

42.095

4.541

100

0.751

1000

0.287

10K

0.207

100K

0.172

0.225

10M

0.274

100M

0.277

0.313

10B

0.327

100B

0.311

As you can see, an optimal buf_size is about 100 Kbytes. And this number may vary in depends on OS and Python versions. Buy why we have a performance degradation on buffer sizes more thank 100K? Then answer may be found using perf and a flame graph below:

As you can see in case of big buffer a system lost some time in exc_page_fault which is obviously connected with handling page fault exception inside Linux kernel.

Play with sys.monitoring

sys.monitoring is a module for catching variety of events in Python code with low overhead. All available events are described in documentation. It may be used by debuggrs, profilers, code coverage utils or custom tools.

Unfortunately, official Python docs don't have any meaningful examples or howto. The simplest use-case for counting exceptions in your service may look like:

from collections import Counter

TEST_TOOL_ID = 4
event_counter = Counter() 

def exception_counter(code, offset, exception):
    event_counter['exception'] += 1

def exception_reraise_counter(code, offset, exception):
    event_counter['exception_reraise'] += 1


if __name__ == '__main__':
    sys.monitoring.use_tool_id(TEST_TOOL_ID, 'test_monitoring')
    sys.monitoring.set_events(TEST_TOOL_ID, sys.monitoring.events.RAISE|sys.monitoring.events.RERAISE)
    sys.monitoring.register_callback(TEST_TOOL_ID, sys.monitoring.events.RAISE, exception_counter) 
    sys.monitoring.register_callback(TEST_TOOL_ID, sys.monitoring.events.RERAISE, exception_reraise_counter) 
    // run here your code to profile
    print(
        f'Profiling results:',
        f'exc_count: {event_counter["exception"]}',
        f'reraise_count: {event_counter["exception_reraise"]}'
    )

A custom identifier TEST_TOOL_ID is defined and registered using use_tool_id call. Then set_events is called for looking only for specified type of events - RAISE and RERAISE. Finally, callbacks are registered for counting separately exceptions and exception reraise. The whole code may be found here.

I/ve received rather unexpected results for basic script with exceptions counter and have to read PEP 669 carefully, Then I asked Mark Shannon directly about such behavior, but haven't received an answer yet. More details may be found here.

I suppose, that now it would be difficult to use sys.monitoring for solving some problems of backend developers in real production, it's a framework for creating other useful developer's tools.

Performance tests

Environment details

Hardware info:

Intel Xeon E3-1275v5, 8 cores
NIC 1 Gbit - Intel I219-LM
2x SSD M.2 NVMe 512 GB
4x RAM 16384 MB DDR4 ECC, 64GB

OS/Kernel info:

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 22.04.3 LTS
Release:        22.04
Codename:       jammy

$ uname  -a
Linux eniac-bench 5.15.0-87-generic #97-Ubuntu SMP Mon Oct 2 21:09:21 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Python versions:

$ ~/local_install/cpython310/bin/python3 --version
Python 3.10.12
$ ~/local_install/cpython311/bin/python3 --version
Python 3.11.6
$ ~/local_install/cpython312/bin/python3 --version
Python 3.12.0

All Python are build from source with the same configuration:

CFLAGS="-fno-omit-frame-pointer -mno-omit-leaf-frame-pointer" ./configure --prefix=/home/dr/local_install/cpython --enable-optimizations

Benchmarks game

In the year 2001(I was a pupil in school that time) Doug Bagley started Benchmarks game project for comparison performance of different languages and this projects is still alive because of infinite curiosity of other great people and continue its life here.

It's not very convenient for comparing performance of different Python versions, however it provided a good set of program to use as benchmarks.

Here is a short description of each test:

Test

Description

binarytrees

Allocate and deallocate many many binary trees

fannkuch_redux

Widely known in narrow circles FANNKUCH Listp benchmark.

fasta

Generate and write random DNA sequences

mandelbrot

Generate Mandelbrot set portable bitmap file

nbody

Double-precision N-body simulation

spectralnorm

Eigenvalue using the power method

simple

Original simple mandelbrot program from 2004 & 2005

too_simple

Ten-lines to loop N million times and sum the Gregory Series Pi

regexredux

Match DNA 8-mers and substitute magic patterns

knucleotide

Hashtable update and k-nucleotide strings

revcomp

Read DNA sequences - write their reverse-complement

pidigits

Streaming arbitrary-precision arithmetic

Here is results of measurements, obtained using hyperfine tool with following input:

hyperfine --warmup 1 --runs 5 -L N N1,n2 -L python_version 310,311,312 --time-unit second --export-json too_simple_hyperfine.json  '~/local_install/cpython{python_version}/bin/python3 path_to_script {N} > /dev/null'

Test info

median, s

delta, %

cpu(user, system), s

python 3.10, N=14

0.258

1.26, 0.04

python 3.11, N=14

0.185

-28.29

0.86, 0.04

python 3.12, N=14

0.188

+1.62

0.85, 0.04

python 3.10, N=21

37.60

252.41 2.54

python 3.11, N=21

27.59

-26.62

179.09 2.66

python 3.12, N=21

27.75

+0.58

180.44 2.62

Test info

median, s

delta, %

cpu(user, system), s

python 3.10, N=50000

3.90

3.89, 0.00

python 3.11, N=50000

2.42

-37.95

2.42 0.00

python 3.12, N=50000

2.28

-5.79

2.30 0.00

python 3.10, N=500000

39.32

39.30, 0.00

python 3.11, N=500000

24.17

-38.53

24.20 0.00

python 3.12, N=500000

22.38

-7.41

22.39 0.00

pidigits - failed to build dependent library python-gmpy2, see a bug report here.

regexredux, knucleotide, revcomp don't show any significant changes of performance for all tested Python versions

I observe major performance improvements in Python 3.11 in comparison with 3.10 in binarytrees, fannkuch_redux, fasta, nbody, spectralnorm, simple, too_simple benchmarks.

If we see on Python 3.12 vs Python 3.11 - situation is not so obvious. For nbody and simple benchmarks performance is better, and for mandelbrot, spectralnorm, too_simple I see some degradation.

Detailed results may be found in github repo with all test results.

Verify asyncio performance boost

The best way to verify performance is to run production-like workload on real-world service. However, it may be inconvenient, because you can't share all results and code of your service to the wide audience. To verify asyncio performance I've designed and developed a simple inventory service. It's akin to services which I created during my career in several gamedev company and released under permissive license for further tests or just for results verification.

The inventory service provides a set of APIs for working with inventory:

API designed using Open API 3.0 spec. For simplicity, I implemented and tested only /v1/inventory/read and /v1/inventory/grant APIs.

A short description of tech stack looks like:

Python [3.10.. 3.12]
Asyncio + Aiohttp + aiohttp_swagger3
asyncpg
PostgreSQL [14..16]

For generating load an awesome tool wrk2 was used. Also I created a couple of simple Lua-scripts for generating real-world like workload. However, all tests a little bit artificial, because I've never released this service to real production.

Test

Description

write

100% of grant requests, also used for initial DB preparing

read

100% of read requests

mix

50% grant, 50% of read requests

metrics

getting prometheus API metrics (database free workload)

Target request rate was picked up to have CPU usage for Python worker about 80-85%. Tests performing against single worker which was deployed to bare-metal host using virtualenv tool. PostgreSQL was deployed to the same host, worker can't create more than 10 connections to PostgreSQL(max_pool_size=10).

Wrk command like looked like:

$ WRK_TEST_NAME=write WRK_TEST_MAX_PLAYER_ID=100000 ./wrk  --latency -t1 -R1000  -c 100 -d 120 -s inventory_bench_wrk.lua http://eniac-bench:808

Here is the table with results:

Test info

actual RPS

latency, 50%, ms

latency, 99%, ms

python 3.10, -R800 -c 100 -d 120

798.64

10.09

63.97

python 3.10, -R900 -c 100 -d 120

892.32

80.45

785.41

python 3.11, -R900 -c 100 -d 120

898.50

6.18

62.27

python 3.11, -R1000 -c 100 -d 120

998.30

8.85

237.31

python 3.12, -R900 -c 100 -d 120

898.51

5.88

50.08

python 3.12, -R1000 -c 100 -d 120

998.30

11.82

331.34

Detailed wrk reports and some info about system load may be found here.

As you can see, Python 3.11/3.12 outperforms 3.10 in write test for about 10-15%. For read/mix/metrics tests latency/throughput are pretty similar for tested Python versions.

Tech debt returning & fun

This is absolutely right idea to plan tech debt returning and do it according to schedule. According to PEP 594 several modules were removed from standard library:

asynchat and asyncore - don't need them anymore because asyncio is the winner
distutils - available as third-party module
imp - importlib should be used instead

All removed and deprecated entities are specified in changelog.

I personally love a small fun feature - command line interface for uuid module!

Now I can not make JSON pretty using Python json.tool, but also generate UUIDs and use curl more efficiently for API exploratory testing:

$ curl -X 'POST' 'http://localhost:8080/v1/inventory/grant' -H 'accept: application/json'   -H 'Content-Type: application/json'   -d '{  "player_id": 0, "item_code": "rocket_launcher", "inventory_type": "weapon", "amount": 1, "ext_trx_id": "'`python3.12 -m uuid`'"}' | python3 -m json.tool
{
    "status": "OK",
    "data": {}
}
$ python3.12 -m uuid
4f578d42-a21d-4ca0-b32b-60ab90a9857e

Conclusion

I suppose that the text and numbers above were a little interesting for backend engineers. As you can see, Python core team did a great work and you may expect significant performance boost for your code(10-15%).

However, as usual it's better to test a new version of Python and libraries on your own workload before pushing such a big change to production. And also some libraries have not supported Python 3.12 yet.

Obviously, the biggest performance boost you receive from Python 3.11 release. It would be also interesting to investigate why 3.12 has some degradations for some benchmarks.

Also it's a good idea to start using perf for understanding behaviour of your systems under load. This is a really poweful tool, which may be used in cloud and bare-metal environments. If you are inspired to try it, you can visit its home page. Also this use case from Liz Rice may be interesting..

Last updated 2 years ago

hashtagThe tastiest features

hashtagThe perf era began

hashtagPreparation

hashtagPerf basic usage

hashtagCase #1: string concatenation

hashtag

hashtagCase #2: logging and string formatting

hashtagCase #3: reading file with a buffer

hashtagPlay with sys.monitoring

hashtagPerformance tests

hashtagEnvironment details

hashtagBenchmarks game

hashtagVerify asyncio performance boost

hashtagTech debt returning & fun

hashtagConclusion

The tastiest features

The perf era began

Preparation

Perf basic usage

Case #1: string concatenation

Case #2: logging and string formatting

Case #3: reading file with a buffer

Play with sys.monitoring

Performance tests

Environment details

Benchmarks game

Verify asyncio performance boost

Tech debt returning & fun

Conclusion