Performance of Built Binaries

Binaries built with PyOxidizer tend to run faster than those executing via a normal python interpreter. There are a few reasons for this.

Resources Data Compiled Into Binary

Traditionally, when Python needs to import a module, it traverses the entries on sys.path and queries the filesystem to see whether a .pyc file, .py file, etc are available until it finds a suitable file to provide the Python module data. If you trace the system calls of a Python process (e.g. strace -f python3 ...), you will see tons of lstat(), open(), and read() calls performing filesystem I/O.

While filesystems cache the data behind these I/O calls, every time Python looks up data in a file the process needs to context switch into the kernel and then pass data back to Python. Repeated thousands of times - or even millions of times across hundreds or thousands of process invocations - the few microseconds of overhead plus the I/O overhead for a cache miss can add up to significant overhead!

When binaries are built with PyOxidizer, all available Python resources are discovered at build time. An index of these resources along with the raw resource data is packed - often into the executable itself - and made available to PyOxidizer’s custom importer. When PyOxidizer services an import statement, looking up a module is effectively looking up a key in a dictionary: there is no explicit filesystem I/O to discover the location of a resource.

PyOxidizer’s packed resources data supports storing raw resource data inline or as a reference via a filesystem path.

If inline storage is used, resources are effectively loaded from memory, often using 0-copy. There is no explicit filesystem I/O. The only filesystem I/O that can occur is indirect, as the operating system pages a memory page on first access. But this all happens in the kernel memory subsystem and is typically faster than going through a functionally equivalent system call to access the filesystem.

If filesystem paths are stored, the only filesystem I/O we require is to open() the file and read() its file descriptor: all filesystem I/O to locate the backing file is skipped, along with the overhead of any Python code performing this discovery.

We can attempt to isolate the effect of in-memory module imports by running a Python script that attempts to import the entirety of the Python standard library. This test is a bit contrived. But it is effective at demonstrating the performance difference.

Using a stock python3.7 executable and 2 PyOxidizer executables - one configured to load the standard library from the filesystem using Python’s default importer and another from memory:

$ hyperfine -m 50 -- '/usr/local/bin/python3.7 -S' import-stdlib-filesystem import-stdlib-memory
Benchmark #1: /usr/local/bin/python3.7 -S
  Time (mean ± σ):     258.8 ms ±   8.9 ms    [User: 220.2 ms, System: 34.4 ms]
  Range (min … max):   247.7 ms … 310.5 ms    50 runs

Benchmark #2: import-stdlib-filesystem
  Time (mean ± σ):     249.4 ms ±   3.7 ms    [User: 216.3 ms, System: 29.8 ms]
  Range (min … max):   243.5 ms … 258.5 ms    50 runs

Benchmark #3: import-stdlib-memory
  Time (mean ± σ):     217.6 ms ±   6.4 ms    [User: 200.4 ms, System: 13.7 ms]
  Range (min … max):   207.9 ms … 243.1 ms    50 runs

  'import-stdlib-memory' ran
    1.15 ± 0.04 times faster than 'import-stdlib-filesystem'
    1.19 ± 0.05 times faster than '/usr/local/bin/python3.7 -S'

We see that the PyOxidizer executable using the standard Python importer has very similar performance to python3.7. But the PyOxidizer executable importing from memory is clearly faster. These measurements were obtained on macOS and the script imports 506 modules.

A less contrived example is running the test harness for the Mercurial version control tool. Mercurial’s test harness creates tens of thousands of new processes that start Python interpreters. So a few milliseconds of overhead starting interpreters or loading modules can translate to several seconds.

We run the full Mercurial test harness on Linux on a Ryzen 3950X CPU using the following variants:

  • hg script with a #!/path/to/python3.7 line (traditional)
  • hg PyOxidizer executable using Python’s standard filesystem import (oxidized)
  • hg PyOxidizer executable using filesystem-relative resource loading (filesystem)
  • hg PyOxidizer executable using in-memory resource loading (in-memory)

The results are quite clear:

Variant CPU Time (s) Delta (s) % Orig
traditional 11,287 0 100
oxidized 10,735 -552 95.1
filesystem 10,186 -1,101 90.2
in-memory 9,883 -1,404 87.6

These results help us isolate specific areas of speedups:

  • oxidized over traditional is a rough proxy for the benefits of python -S over python. Although there are other factors at play that may be influencing the numbers.
  • filesystem over oxidized isolates the benefits of using PyOxidizer’s importer instead of Python’s default importer. The performance wins here are due to a) avoiding excessive I/O system calls to locate the paths to resources and b) functionality being implemented in Rust instead of Python.
  • in-memory over filesystem isolates the benefits of avoiding explicit filesystem I/O to load Python resources. The Rust code backing these 2 variants is very similar. The only meaningful difference is that in-memory constructs a Python object from a memory address and filesystem must open and read a file using standard OS mechanisms before doing so.

From this data, one could draw a few conclusions:

  • Processing of the site module during Python interpreter initialization can add substantial overhead.
  • Maintaining an index of Python resources such that you can avoid discovery via filesystem I/O provides a meaningful speedup.
  • Loading Python resources from an in-memory data structure is faster than incurring explicit filesystem I/O to do so.

Ignoring site

In its default configuration, binaries produced with PyOxidizer configure the embedded Python interpreter differently from how a python is typically configured.

Notably, PyOxidizer disables the importing of the site module by default (making it roughly equivalent to python -S). The site module does a number of things, such as look for .pth files, looks for site-packages directories, etc. These activities can contribute substantial overhead, as measured through a normal python3.7 executable on macOS:

$ hyperfine -m 500 -- '/usr/local/bin/python3.7 -c 1' '/usr/local/bin/python3.7 -S -c 1'
Benchmark #1: /usr/local/bin/python3.7 -c 1
  Time (mean ± σ):      22.7 ms ±   2.0 ms    [User: 16.7 ms, System: 4.2 ms]
  Range (min … max):    18.4 ms …  32.7 ms    500 runs

Benchmark #2: /usr/local/bin/python3.7 -S -c 1
  Time (mean ± σ):      12.7 ms ±   1.1 ms    [User: 8.2 ms, System: 2.9 ms]
  Range (min … max):     9.8 ms …  16.9 ms    500 runs

  '/usr/local/bin/python3.7 -S -c 1' ran
    1.78 ± 0.22 times faster than '/usr/local/bin/python3.7 -c 1'

Shaving ~10ms off of startup overhead is not trivial!