NixGlHost Internals

TL;DR: How Does it Work?

Nix-gl-host's sole goal is to massage your system graphic driver in order to re-use it in your nix-built OpenGL/Cuda programs.

The "top-level" user space graphic libraries are injected into your nix-built 3D-graphics program through the LD_LIBRARY_PATH. By top-level, we designate libraries the user space program is directly dynamically linking on. That's the GLX/EGL and Cuda user space libraries. These "top-level" dynamically shared objects (DSOs) are dynamically loaded by the user space application.

The actual library name resolution system depends on the graphics/computing API you're using:

GLX: libglvnd, aka. libGL.so is acting as a layer of indirection. It queries Xorg and "asks" which user-space the libGLX.so driver should load. This user-space driver selection will depend on the graphics hardware attached to the screen where we're trying to render the application.
EGL: libglvnd, aka. libGL.so also acts as a layer of indirection. It loads JSON configuration files that are usually generated by your package manager, or in our case, by nix-gl-host. These JSON configuration files point to the appropriate EGL user-space driver libraries for the current system.
Cuda: the programs are linked against an SDK which loads the userspace libcuda.so driver.

All these user-space drivers are tightly coupled to their kernel-space counterparts. They act as a translation layer between the GL/Cuda instructions and the hardware-specific kernel space "thin" driver.

A Hard Problem to Solve and a Partial Fix

In the Nix world, injecting host DSOs into the Nix library path is a big no-no. We usually try to keep them out of the Nix closure. And for a good reason: it's the only thing making the closure runtime behavior reproducible across different systems.

Opengl, Cuda, and OpenCV are annoying cases though. The runtime closure has to depend on the system hardware to properly function. This requirement is directly conflicting with the reproducible nature of Nix closures. Note: that's not a single out exception, we run into the same issues with the Glibc NSS mechanism and PAM modules.

To add some more chaos to this sorry state, the "top-level" graphic driver DSOs (the ones dynamically linked against the user application) are themselves depending on some other DSOS. Some are vendor-specific, some are more generic, such as libpthread, libffi, libexpat, etc...

We don't want to inject all these dependencies directly to the LD_LIBRARY_PATH: we'd be potentially creating some ABI incompatibilities. So instead, we only inject the "top-level" libraries and rewrite their runpath ELF sections to point to the directory containing the dependencies. That way, the "top-level" DSOs can find their dependencies in their library path, while the wrapped program ends up only with the "top-level" dependencies in its library path.

You get the idea. This concept seems dodgy at first, and it is to some extent! However, in practice, we haven't not run into any ABI incompatibilities issues so far.

As discussed in the previous section, we have to patch the runpath section of the "top-level" DSOs. That being said, we don't want to alter the host graphic stack in any way: we'd risk breaking the host GL/Cuda setup.

We rather identify the relevant host DSOs from the library path and copy them to a cache directory. We then patch the DSOs copies living in this cache directory.

The Cache Directory

The cache directory structure looks like overall this:

cache (XDG_CACHE_HOME, often at ~/.cache/nix-gl-host)
  - sha256 host path
    - lib
      - [dsos] runpath = ./.
    - glx [Injected to LD_LIBRARY_PATH]
      - [dsos] runpath = ../lib
    - egl [Injected to LD_LIBRARY_PATH]
      - [dsos] runpath = ../lib
    - cuda [Injected to LD_LIBRARY_PATH]
      - [dsos] runpath = ../lib

Where:

sha256 host path: a directory coming from the host library path. EG. /usr/lib, /usr/lib/gnu-xxx/, etc. To prevent any name conflict, we sha256-hash the absolute path of the directory and use it as a directory name.
sha256/lib: the directory containing all the "top-level" DSOs dependencies.
sha256/glx: the "top-level" GLX DSOs. Their runpath is pointing to ../lib.
sha256/egl: the "top-level" EGL DSOs. Their runpath is pointing to ../lib.
sha256/cuda: the "top-level" Cuda DSOs. Their runpath is pointing to ../lib.

Design Limitations

DSOs conflict

NixGLHost injects some host DSOs into Nix closures. Even though we're trying to limit the blast radius as much as possible with the runpath trick described in the previous section, if the program also uses some of the DSOs previously injected, we could end up running into some ABI incompatibilities.

The list of available symbols is global, so which symbol version is in memory depends on the loading order. In the end, with the current design, the version of a loaded symbol depends on what loaded it. The main program loads the nix-provided symbols, and the graphics driver loads the host-provided symbols. The order is somewhat unpredictable. It's determined on a first-come-first-served basis.

There are solutions to isolate these symbols further, like the dlmopen-based solutions such as libcapsule. That being said, adopting this isolation trick would force us to explicitly list the exposed ABI symbols of each and every DSO, which is far from trivial.

After this rather scary warning, it's important to mention that we didn't run into such an incompatibility situation so far. The NVidia proprietary driver dependencies seem to have a fairly stable ABI.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

INTERNALS.md

INTERNALS.md

NixGlHost Internals

TL;DR: How Does it Work?

A Hard Problem to Solve and a Partial Fix

The Cache Directory

Design Limitations

DSOs conflict

Files

INTERNALS.md

Latest commit

History

INTERNALS.md

File metadata and controls

NixGlHost Internals

TL;DR: How Does it Work?

A Hard Problem to Solve and a Partial Fix

The Cache Directory

Design Limitations

DSOs conflict