Performance of Built Binaries¶
Binaries built with PyOxidizer tend to run faster than those executing via
python interpreter. There are a few reasons for this.
Resources Data Compiled Into Binary¶
Traditionally, when Python needs to
import a module, it traverses
the entries on
sys.path and queries the filesystem to see whether
.py file, etc are available until it finds a
suitable file to provide the Python module data. If you trace the
system calls of a Python process (e.g.
strace -f python3 ...),
you will see tons of
performing filesystem I/O.
While filesystems cache the data behind these I/O calls, every time Python looks up data in a file the process needs to context switch into the kernel and then pass data back to Python. Repeated thousands of times - or even millions of times across hundreds or thousands of process invocations - the few microseconds of overhead plus the I/O overhead for a cache miss can add up to significant overhead!
When binaries are built with PyOxidizer, all available Python resources
are discovered at build time. An index of these resources along with
the raw resource data is packed - often into the executable itself -
and made available to PyOxidizer’s
custom importer. When PyOxidizer services an
import statement, looking up a module is effectively looking up a key
in a dictionary: there is no explicit filesystem I/O to discover the
location of a resource.
PyOxidizer’s packed resources data supports storing raw resource data inline or as a reference via a filesystem path.
If inline storage is used, resources are effectively loaded from memory, often using 0-copy. There is no explicit filesystem I/O. The only filesystem I/O that can occur is indirect, as the operating system pages a memory page on first access. But this all happens in the kernel memory subsystem and is typically faster than going through a functionally equivalent system call to access the filesystem.
If filesystem paths are stored, the only filesystem I/O we require
open() the file and
read() its file descriptor: all
filesystem I/O to locate the backing file is skipped, along with the
overhead of any Python code performing this discovery.
We can attempt to isolate the effect of in-memory module imports by running a Python script that attempts to import the entirety of the Python standard library. This test is a bit contrived. But it is effective at demonstrating the performance difference.
Using a stock
python3.7 executable and 2
PyOxidizer executables - one
configured to load the standard library from the filesystem using Python’s
default importer and another from memory:
$ hyperfine -m 50 -- '/usr/local/bin/python3.7 -S import_stdlib.py' import-stdlib-filesystem import-stdlib-memory Benchmark #1: /usr/local/bin/python3.7 -S import_stdlib.py Time (mean ± σ): 258.8 ms ± 8.9 ms [User: 220.2 ms, System: 34.4 ms] Range (min … max): 247.7 ms … 310.5 ms 50 runs Benchmark #2: import-stdlib-filesystem Time (mean ± σ): 249.4 ms ± 3.7 ms [User: 216.3 ms, System: 29.8 ms] Range (min … max): 243.5 ms … 258.5 ms 50 runs Benchmark #3: import-stdlib-memory Time (mean ± σ): 217.6 ms ± 6.4 ms [User: 200.4 ms, System: 13.7 ms] Range (min … max): 207.9 ms … 243.1 ms 50 runs Summary 'import-stdlib-memory' ran 1.15 ± 0.04 times faster than 'import-stdlib-filesystem' 1.19 ± 0.05 times faster than '/usr/local/bin/python3.7 -S import_stdlib.py'
We see that the
PyOxidizer executable using the standard Python importer
has very similar performance to
python3.7. But the
importing from memory is clearly faster. These measurements were obtained
on macOS and the
import_stdlib.py script imports 506 modules.
A less contrived example is running the test harness for the Mercurial version control tool. Mercurial’s test harness creates tens of thousands of new processes that start Python interpreters. So a few milliseconds of overhead starting interpreters or loading modules can translate to several seconds.
We run the full Mercurial test harness on Linux on a Ryzen 3950X CPU using the following variants:
hgscript with a
hgPyOxidizer executable using Python’s standard filesystem import (oxidized)
hgPyOxidizer executable using filesystem-relative resource loading (filesystem)
hgPyOxidizer executable using in-memory resource loading (in-memory)
The results are quite clear:
CPU Time (s)
These results help us isolate specific areas of speedups:
oxidized over traditional is a rough proxy for the benefits of
python. Although there are other factors at play that may be influencing the numbers.
filesystem over oxidized isolates the benefits of using PyOxidizer’s importer instead of Python’s default importer. The performance wins here are due to a) avoiding excessive I/O system calls to locate the paths to resources and b) functionality being implemented in Rust instead of Python.
in-memory over filesystem isolates the benefits of avoiding explicit filesystem I/O to load Python resources. The Rust code backing these 2 variants is very similar. The only meaningful difference is that in-memory constructs a Python object from a memory address and filesystem must open and read a file using standard OS mechanisms before doing so.
From this data, one could draw a few conclusions:
Processing of the
sitemodule during Python interpreter initialization can add substantial overhead.
Maintaining an index of Python resources such that you can avoid discovery via filesystem I/O provides a meaningful speedup.
Loading Python resources from an in-memory data structure is faster than incurring explicit filesystem I/O to do so.
In its default configuration, binaries produced with PyOxidizer configure
the embedded Python interpreter differently from how a
Notably, PyOxidizer disables the importing of the
site module by
default (making it roughly equivalent to
python -S). The
does a number of things, such as look for
.pth files, looks for
site-packages directories, etc. These activities can contribute
substantial overhead, as measured through a normal
$ hyperfine -m 500 -- '/usr/local/bin/python3.7 -c 1' '/usr/local/bin/python3.7 -S -c 1' Benchmark #1: /usr/local/bin/python3.7 -c 1 Time (mean ± σ): 22.7 ms ± 2.0 ms [User: 16.7 ms, System: 4.2 ms] Range (min … max): 18.4 ms … 32.7 ms 500 runs Benchmark #2: /usr/local/bin/python3.7 -S -c 1 Time (mean ± σ): 12.7 ms ± 1.1 ms [User: 8.2 ms, System: 2.9 ms] Range (min … max): 9.8 ms … 16.9 ms 500 runs Summary '/usr/local/bin/python3.7 -S -c 1' ran 1.78 ± 0.22 times faster than '/usr/local/bin/python3.7 -c 1'
Shaving ~10ms off of startup overhead is not trivial!