Loading Resource Files

Many Python application need to load resources. Resources are typically non-Python support files, such as images, config files, etc. In some cases, resources could be Python source or bytecode files. For example, many plugin systems load Python modules outside the context of the normal import mechanism and therefore treat standalone Python source/bytecode files as non-module resources.

oxidized_importer has support for loading resource files. But compatibility with Python’s expected behavior may vary.

Python Resource Loading Mechanisms

Before we talk about oxidized_importer’s support for resource loading, it is important to understand how Python code in the wild can load resources.

We’ll overview them in the chronological order they were introduced into the Python ecosystem.

The most basic and oldest mechanism to load resources is to perform raw filesystem I/O. Typically, Python code looks at __file__ to get the filename of the current module. Then, it calculates the directory name and derives paths to resource files using e.g. os.path.join(). It will usually then open() these paths directly.

Python packaging evolved over time. Packaging tools could express various metadata at build time, such as supplementary resource files. This metadata would be installed next to a package and APIs could be used to access it. One such API was pkg_resources. Using e.g. pkg_resources.resource_string("foo", "bar.txt"), you could obtain the content of the resource bar.txt in the foo package.

pkg_resources had useful functionality. And it was the recommended mechanism for loading resource files for several years. But it wasn’t part of the Python standard library and needed to be explicitly installed. So not everyone used it.

Python 3.1 added the importlib package, which is the primary home for all core functionality related to import. Python importers were now defined via interfaces. One of those interfaces is ResourceLoader. It has a single method get_data(path). Given a Python module’s loader (e.g. via the __loader__ attribute on the module), you could call get_data(path) and load a resource. e.g. import foo; foo.__loader__.get_data("bar.txt").

The standard library only had ResourceLoader for several years. And ResourceLoader wasn’t exactly a convenient API to use because it was so low-level. Many Python applications continued to use pkg_resources or direct file-based I/O.

Python 3.7 introduced significant improvements to resource loading in the standard library.

At a low level, module loaders could now implement a get_resource_reader(name) method, which would return an object implementing the ResourceReader interface. This interface defined methods like open_resource(name) and contents() to open a file-like handle on a named resource and obtain a list of all available resources.

At a high level, the importlib.resources package provided a user-friendly API for interacting with ResourceReader instances. You could call e.g. importlib.resources.open_binary(package, name) to obtain a file-like handle on a specific resource within a package.

Python 3.7’s new resource APIs finally gave the Python standard library access to powerful APIs for loading resources without using a 3rd party package (like pkg_resources).

At the time of writing this in April 2020, it looks like Python 3.9 will invent yet another low-level resource loading API.

Because Python hasn’t had a robust resource loading API in the standard library for much of its history, lots of Python code in the wild does not make use of the APIs in the standard library. It is not uncommon to see code in 2020 that still uses __file__ to load resources. Furthermore, because Python 3.7 is still relatively young and code may wish to maintain compatibility with older Python versions, the newer APIs may be actively avoided.

Important

As of Python 3.8, ResourceReader and importlib.resources are the most robust mechanisms for loading resources and we recommend adopting these APIs if possible.

Support for ResourceReader

oxidized_importer implements the ResourceReader interface for loading resource files.

However, compatibility with Python’s default filesystem-based implementation can vary. Unfortunately, various behavior with ResourceReader is undefined, so it isn’t clear if CPython or oxidized_importer is buggy here.

oxidized_importer maintains an index of known resource files. This index is logically a dict of dict``s, where the outer key is the Python package name and the inner key is the resource name. Package names are fully qualified. e.g. ``foo or foo.bar. Resource names are effectively relative filesystem paths. e.g. resource.txt or subdir/resource.txt. The relative paths always use / as the directory separator, even on Windows.

OxidizedFinder.get_resource_reader() returns instances of OxidizedResourceReader. Each instance is bound to a specific Python package: that’s how they are defined. When an OxidizedResourceReader receives the name of a resource, it performs a simple lookup in the global resources index. If the string key is found, it is used. Otherwise, it is assumed the resource doesn’t exist.

The OxidizedResourceReader.contents() method will return a list of all keys in the internal resources index.

OxidizedResourceReader works the same way for in-memory and filesystem-relative resource locations because internally both use the same index of resources to drive execution: only the location of the resource content varies.

OxidizedResourceReader’s implementation varies from the standard library filesystem-based implementation in the following ways:

  • OxidizedResourceReader.contents() will return keys from the package’s resources dictionary, not all the files in the same directory as the underlying Python package (the standard library uses os.listdir()). OxidizedResourceReader will therefore return resource names in sub-directories as long as those sub-directories aren’t themselves Python packages.

  • Resources must be explicitly registered with OxidizedFinder as such in order to be exposed via the resources API. By contrast, the filesystem-based importer - relying on os.listdir() - will expose all files in a directory as a resource. This includes .py files.

  • OxidizedResourceReader.is_resource() will return True for resource names containing a slash. Contrast with Python’s, which returns False (even though you can open a resource with ResourceReader.open_resource() for the same path). OxidizedResourceReader’s behavior is more consistent.

Support for ResourceLoader

OxidizedFinder implements the deprecated ResourceLoader interface and get_data(path) will return bytes instances for registered resources or raise OSError on request of an unregistered resource.

The path passed to get_data(path) MUST be an absolute path that has the prefix of either the currently running executable file or the directory containing it.

If the resource path is prefixed with the current executable’s path, the path components after the current executable path are interpreted as the path to a resource registered for in-memory loading.

If the resource path is prefixed with the current executable’s directory, the path components after this directory are interpreted as the path to a resource registered for application-relative loading.

All other resource paths aren’t recognized and an OSError will be raised. There is no fallback to loading from the filesystem, even if a valid filesystem path pointing to an existing file is passed in.

Note

The behavior of not servicing paths that actually exist but aren’t registered with OxidizedFinder as resources may be overly opinionated and undesirable for some applications.

If this is a legitimate use case for your application, please create a GitHub issue to request this feature.

Once a path is recognized as having the prefix of the current executable or its directory, the remaining path components will be interpreted as the resource path. This resource path logically contains a package name component and a resource name component. OxidizedFinder will traverse all potential package names starting from the longest/deepest up until the top-level package looking for a known Python package. Once a known package name is encountered, its resources will be consulted. At most 1 package will be consulted for resources.

Here is a concrete example.

If the path is /usr/bin/myapp/foo/bar/resource.txt and the current executable is /usr/bin/myapp, the requested resource will be foo/bar/resource.txt. Since the path was prefixed with the executable path, only resources registered for in-memory loading will be consulted.

Our candidate package names are foo.bar and foo, in that order.

If foo.bar is a known package and resource.txt is registered for in-memory loading, that resource’s contents will be returned.

If foo.bar is a known package and resource.txt is not registered in that package, OSError is raised.

If foo.bar is not a known package, we proceed to check for package foo.

If foo is a known package and bar/resource.txt is registered for in-memory loading, its contents will be returned.

Otherwise, we’re out of possible packages, so OSError is raised.

Similar logic holds for resources registered for filesystem-relative loading. The difference here is the stripped path prefix and we are only looking for resources registered for filesystem-relative loading. Otherwise, the traversal logic is exactly the same.

If OSError is raised due to a missing resource, its errno is ENOENT and its filename is the passed in path. Python should automatically translate this to a FileNotFoundError exception. But callers should catch OSError, as other OSError variants can be raised (e.g. for file permission errors).

Support for __file__

OxidizedFinder may or may not set the __file__ attribute on loaded modules. See __file__ and __cached__ Module Attributes for details.

Therefore, Python code relying on the presence of __file__ to derive paths to resource files may or may not work with oxidized_importer.

Code utilizing __file__ for resource loading is highly encouraged to switch to the importlib.resources API. If this is not possible, you can change packaging settings to move the resource locations from in-memory to filesystem-relative, as __file__ is set when loading modules from the filesystem.

Support for pkg_resources

oxidized_importer has support for working with pkg_resources.

oxidized_importer integration with pkg_resources is enabled by calling register_pkg_resources().

If an OxidizedFinder imports the pkg_resources module, register_pkg_resources() may be called automatically.

The pyembed crate and PyOxidizer both have this functionality enabled by default and will likely have OxidizedFinder servicing the pkg_resources import. So there are likely no additional steps needed to enable pkg_resources support in these scenarios.

If you are using oxidized_importer as a standalone extension module in the context of a regular Python interpreter, you may need to call register_pkg_resources() manually to ensure integration is enabled.

To test whether integration is enabled, look for an <class ‘OxidizedFinder’>: <class ‘OxidizedPkgResourcesProvider’> entry in pkg_resources._provider_factories.

Distribution Resolving

OxidizedPathEntryFinder is a path entry finder type that responds to paths via the sys.path_hooks mechanism.

Distribution resolution support requires OxidizedFinder.path_hook to be registered on sys.path_hook and for register_pkg_resources() to have been called. If both these conditions are satisfied, pkg_resources should be able to find package distributions indexed by OxidizedFinder instances.

pkg_resources_find_distributions() is the callable registered with pkg_resources for resolving distributions. It respects path targeting and the only flag, per the behavior documented by pkg_resources.

Metadata and Resource Resolving

If pkg_resources derives the provider for any module loaded with OxidizedFinder or OxidizedPathEntryFinder, it should create an instance of OxidizedPkgResourcesProvider to resolve package metadata and resource info.

There are known behavior differences with OxidizedPkgResourcesProvider that may result in runtime errors. See that type’s API documentation for more.

Porting Code to Modern Resources APIs

Say you have resources next to a Python module. Legacy code inside a module might do something like the following:

def get_resource(name):
    """Return a file handle on a named resource next to this module."""
    module_dir = os.path.abspath(os.path.dirname(__file__))
    # Warning: there is a path traversal attack possible here if
    # name continues values like ../../../../../etc/password.
    resource_path = os.path.join(module_dir, name)

    return open(resource_path, 'rb')

Modern code targeting Python 3.7+ can use the ResourceReader API directly:

def get_resource(name):
    """Return a file handle on a named resource next to this module."""
    # get_resource_reader() may not exist or may return None, which this
    # code doesn't handle.
    reader = __loader__.get_resource_reader(__name__)
    return reader.open_resource(name)

The ResourceReader interface is quite low-level. If you want something higher level or want to access resources outside the current module, it is recommended to use the importlib.resources APIs. e.g.:

import importlib.resources

with importlib.resources.open_binary('mypackage', 'resource-name') as fh:
    data = fh.read()

The importlib.resources functions are glorified wrappers around the low-level interfaces on module loaders. But they do provide some useful functionality, such as additional error checking and automatic importing of modules, making them useful in many scenarios, especially when loading resources outside the current package/module.

Maintaining Compatibility With Python <3.7

If you want to maintain compatibility with Python <3.7, you can’t use ResourceReader or importlib.resources, as they are not available. The recommended solution here is to use a shim.

The best shim to use is importlib_resources. This is a standalone Python package that is a backport of importlib.resources to older Python versions. Essentially, you can always get the APIs from the latest Python version. This shim knows about the various APIs available on Loader instances and chooses the best available one. It should just work with oxidized_importer.

If you want to implement your own shim without introducing a dependency on importlib_resources, the following code can be used as a starting implementation:

import importlib

try:
    import importlib.resources
    # Defeat lazy module importers.
    importlib.resources.open_binary
    HAVE_RESOURCE_READER = True
except ImportError:
    HAVE_RESOURCE_READER = False

try:
    import pkg_resources
    # Defeat lazy module importers.
    pkg_resources.resource_stream
    HAVE_PKG_RESOURCES = True
except ImportError:
    HAVE_PKG_RESOURCES = False


def get_resource(package, resource):
    """Return a file handle on a named resource in a Package."""

    # Prefer ResourceReader APIs, as they are newest.
    if HAVE_RESOURCE_READER:
        # If we're in the context of a module, we could also use
        # ``__loader__.get_resource_reader(__name__).open_resource(resource)``.
        # We use open_binary() because it is simple.
        return importlib.resources.open_binary(package, resource)

    # Fall back to pkg_resources.
    if HAVE_PKG_RESOURCES:
        return pkg_resources.resource_stream(package, resource)

    # Fall back to __file__.

    # We need to first import the package so we can find its location.
    # This could raise an exception!
    mod = importlib.import_module(package)

    # Undefined __file__ will raise NameError on variable access.
    try:
        package_path = os.path.abspath(os.path.dirname(mod.__file__))
    except NameError:
        package_path = None

    if package_path is not None:
        # Warning: there is a path traversal attack possible here if
        # resource contains values like ../../../../etc/password. Input
        # must be trusted or sanitized before blindly opening files or
        # you may have a security vulnerability!
        resource_path = os.path.join(package_path, resource)

        return open(resource_path, 'rb')

    # Could not resolve package path from __file__.
    raise Exception('do not know how to load resource: %s:%s' % (
                    package, resource))

(The above code is dedicated to the public domain and can be used without attribution.)

This code is provided for example purposes only. It may or may not be sufficient for your needs.