Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement GNU Make jobserver client protocol support in Ninja #2506

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

digit-google
Copy link
Contributor

@digit-google digit-google commented Oct 1, 2024

This PR implements the GNU Make jobserver client protocol in Ninja. This implements all client modes for Posix (pipe and fifo) and Windows (semaphore).

This protocol allows several participating processes to coordinate parallel build tasks / processing work.
For example, GNU Make, and now Ninja, use it to control how many parallel commands they dispatch at any given time. The Rust compiler and linker, and some C++ compilers (e.g. Clang has -flto=jobserver), use that to control how many parallel threads in a single invocation. The protocol is also implemented by the cargo Rust tool (but the latter only sets the CARGO_MAKEFLAGS environment variable).

Client mode is useful when Ninja is invoked as part of a more complex build, that launches several build tasks in parallel (e.g. recursive Make or CMake invocations). In this mode, Ninja detects that MAKEFLAGS contains --jobserver-auth or --jobserver-fds options, and uses the job slot pool to control its own dispatch of parallel build commands. It also passes the MAKEFLAGS value to child processes to let them participate in the coordination protocol.

This also includes a new script misc/jobserver_pool.py that can be used as a standalone job slot pool implementation, which can be used any client directly for testing.

This has been tested on large Fuchsia build plans, with certain build configurations that launch 24 sub-Ninja builds from a top-level Ninja build plan. With remote builders enabled, this reduces the total time from 22minutes to 12minutes.

This work is inspired by contributions from many other developers, including @hundeboll (see PR #2450), @mcprat (see PR #2474) and @stefanb2 (PR #1140) to name a few.

EDIT: (Removed mention of server mode as this has been pushed to a future PR).

Copy link

@kalvdans kalvdans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments on the code; I haven't actually tried your branch.

misc/jobserver_pool.py Outdated Show resolved Hide resolved
misc/jobserver_pool.py Show resolved Hide resolved
src/jobserver.cc Outdated Show resolved Hide resolved
src/jobserver.h Outdated Show resolved Hide resolved
@Neustradamus
Copy link

To follow this important PR.

src/jobserver.cc Outdated Show resolved Hide resolved
misc/jobserver_pool.py Outdated Show resolved Hide resolved
src/jobserver-posix.cc Outdated Show resolved Hide resolved
@nanonyme
Copy link

Does this work together with -flto=auto such that ninja limits flto threads when working as job server?

@nekopsykose
Copy link

Does this work together with -flto=auto such that ninja limits flto threads when working as job server?

for gcc, it should yes. the gcc lto wrapper basically emits a makefile and calls make on it to run the ltrans jobs, and that sub-make will take jobserver tokens from the ninja jobserver in this pr.

for clang, there is no equivalent- no part of the thinlto process or lld itself takes jobserver tokens (but maybe i missed something), that would require implementing the protocol (reading env vars, having lld and perhaps also the linker plugin used in other linkers spawn threads only based on grabbed tokens, ..)

@kaspar030
Copy link

Just to report: I'm developing laze, a build system calling out to Ninja for actual building.

One of laze's core features is its native support for large build matrizes of mostly similar build configurations (e.g., build something for dozens to hundreds of slightly different embedded devices).

In one project, there's a test application that calls out to cmake for building a library, and cmake calls Ninja recursively. So when building this application for 10 boards at the same time (one laze command resulting in one Ninja call), previously, this would on my laptop call 10 cmake+Ninja simultaneously, and each sub-Ninja would run up to 10 gcc instance. While there's enough RAM so this actually finishes, my laptop becomes unresponsive, typing laggy, ...

Using the Ninja binary as built by this PRs CI output and with NINJA_JOBSERVER=1, the number of gcc instances stays at or below 10. Which IMO is the expected behavior and this PR finally adds this to Ninja.

I gave this a hyperfine run to see if total build time is affected, and it seems like the jobserver version is slightly slower, within margin of error:

tests/pkg/relic on  add_laze_buildfiles took 40s371ms 
❯ hyperfine -p "rm -Rf ../../../build" "laze b"
Benchmark 1: laze b
  Time (mean ± σ):     41.419 s ±  1.242 s    [User: 241.710 s, System: 39.031 s]
  Range (min … max):   38.262 s … 42.548 s    10 runs
 
tests/pkg/relic on  add_laze_buildfiles took 6m57s586ms 
❯ NINJA_JOBSERVER=0 hyperfine -p "rm -Rf ../../../build" "laze b"
Benchmark 1: laze b
  Time (mean ± σ):     41.102 s ±  0.968 s    [User: 245.940 s, System: 38.790 s]
  Range (min … max):   38.645 s … 41.908 s    10 runs

(but, the jobserver version did not make typing here laggy ... 🙂)

@sw
Copy link

sw commented Oct 17, 2024

Using --jobserver without an argument doesn't work for me.

c:\>ninja --jobserver
ninja: fatal: invalid -j parameter

c:\>ninja --jobserver=0
ninja: error: loading 'build.ninja': The system cannot find the file specified.

c:\>ninja --jobserver=1
ninja: error: loading 'build.ninja': The system cannot find the file specified.

Also, setting NINJA_JOBSERVER and then trying to limit the number of parallel builds on the command line doesn't work, but that seems to be intended ("Explicit parallelism (-j), ignoring NINJA_JOBSERVER environment variable."). Maybe that's what @thesamesam alluded to in #1139. I don't really see a reason for this - is it so the child processes don't accidentally try to be jobservers as well?

With ninja -jX --jobserver=1, it seems to work as expected. We are using Ninja alongside CMake on Windows in a project with many ExternalProjects, which up to now would cause the X*N problem. So I hope that a solution can finally be merged.

@robUx4
Copy link

robUx4 commented Oct 18, 2024

I gave a test on the VLC contrib build which builds in parallel more than a hundred libraries using autotools (make), CMake (ninja) and meson (ninja).

The maximum number of threads seems to be respected on my local machine.

In the CI things are working properly as well in Debian and Ubuntu.

On one machine it logs:

ninja: Jobserver mode detected: k -j48 --jobserver-auth=4,5

On the other:

ninja: Jobserver mode detected: k -j64 -Orecurse --jobserver-auth=3,4

@nanonyme
Copy link

Using --jobserver without an argument doesn't work for me.

c:\>ninja --jobserver
ninja: fatal: invalid -j parameter

c:\>ninja --jobserver=0
ninja: error: loading 'build.ninja': The system cannot find the file specified.

c:\>ninja --jobserver=1
ninja: error: loading 'build.ninja': The system cannot find the file specified.

Also, setting NINJA_JOBSERVER and then trying to limit the number of parallel builds on the command line doesn't work, but that seems to be intended ("Explicit parallelism (-j), ignoring NINJA_JOBSERVER environment variable."). Maybe that's what @thesamesam alluded to in #1139. I don't really see a reason for this - is it so the child processes don't accidentally try to be jobservers as well?

With ninja -jX --jobserver=1, it seems to work as expected. We are using Ninja alongside CMake on Windows in a project with many ExternalProjects, which up to now would cause the X*N problem. So I hope that a solution can finally be merged.

This sounds like an inconvenient limitation. Why couldn't ninja be jobserver with environment variable when explicit -jN is set?

@digit-google
Copy link
Contributor Author

Thanks, the reason why --jobserver does not work on Windows for @sw is interesting. On this platform, we use our own src/getopt.c implementation which, apparently, only supports optional arguments for short options, not long one.

Besides that, the getopt_long() manpage states that arguments for long options should be provided as --option=arg or --option arg only, but does not say how optional arguments should be processed. This means that something like ninja --jobserver <target> is ambiguous, as it would technically be interpreted as equivalent to ninja --jobserver=<target> which will likely fail.

I am going to get rid of the problem by making --jobserver a simple flag, and adding --jobserver-mode=<mode> to specify the mode instead (so --jobserver-mode=0 will be needed to disable the feature even if NINJA_JOBSERVER is defined in the environment).

Apart from that, @nanonyme is correct that this was to avoid child processes to become jobserver themselves by accident. However, this can be solved by ensuring that NINJA_JOBSERVER is never defined in the these processes, which is simpler, so I'll change this too.

Quick question regarding behavior: Currently:

  • Using an explicit -j1 disables jobserver client mode, as well as pool mode, as this is interpreted by the client not wanting parallel dispatch).

  • Using an explicit -j0 disables jobserver pool mode, but not client mode, as this is interpreted by the client asking for "infinite parallelism", which to me seems only useful to see how bad the system reacts under heavy load.

If anyone thinks this is not reasonable or would create a problem for their workflow, let me know. I selected these conditions on a hunch since I never use these myself (well except -j1 in very rare cases).

@digit-google digit-google force-pushed the jobserver branch 2 times, most recently from 4c73cd6 to 146c55d Compare October 18, 2024 22:20
@kepstin
Copy link

kepstin commented Oct 21, 2024

Is there any chance that jobserver mode could be made "automatic" by default? I.e. act as a jobserver client if a jobserver is available from the environment, otherwise (if -j is set) start a jobserver pool?

This would simplify the use of ninja quite a bit - you don't have to worry about remembering to set an extra ninja command-line parameter or environment variable to ensure that recursive builds, rust, gcc lto., etc. parallelize properly.

@jhasse
Copy link
Collaborator

jhasse commented Oct 21, 2024

Hm ... for automatic detection of the client: Should this be done by checking MAKEFLAGS?
And whether to automatically spawn a server: Build edges could specify if they are a jobserver client and if one edge with that option is part of the build, ninja would activate the jobserver automatically.

Some more general comments about this PR:

  1. I don't like the new environment variable NINJA_JOBSERVER, we should keep them to a minimum and I don't see why it is needed.
  2. I would only implement the newer (better) fifo mode on Linux. That way the --jobserver-mode wouldn't be needed.

configure.py Outdated Show resolved Hide resolved
doc/manual.asciidoc Outdated Show resolved Hide resolved

- Dry-run (i.e. `-n` or `--dry-run`) is not enabled.

- `-j1` (no parallelism) is not used on the command line.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm ... why though?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because -j1 is how you tell Ninja you do not want to launch parallel jobs at all. So using it in the current implementation disables both client and pool mode at the same time (it doesn't make sense to use a pool of one job slot).

Similarly, -j0 means "infinite parallelism", so it disables pool mode, but not client mode.

I can change that if you prefer, but I believe these are sane defaults. I clarified this behavior in the documentation though.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But as a client wouldn't couldn't I still be limited by jobserver? I.e. waiting on a token to become available. But I guess that isn't supported yet as nothing will wake up ninjas main loop in that case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The possible behaviors are the following when a jobserver pool is in place in MAKEFLAGS:

A) -j1 disables parallel tasks and ignores the pool
B) -j2 and above (also -j0) ignore the command-line job count, and use the pool to control parallelism.
C) Without a command-line -j parameter, use the pool.

D) -j1 is ignored, Ninja uses the pool to control parallelism.
E) -j2 and above (also -j0) ignore the pool and use the command-line job count instead.

The current PR implements A) and B) because -j1 is the only way to tell Ninja that we do not want parallelism, and the only reason to do that would be for debugging a build or because the system is very constrained (e.g. not enough RAM). Hence, ignoring it if a jobserver pool is in place seems unhelpful.

The reason B) exists (explicit job counts are ignored when the pool is in place) is because many build scripts will invoke Ninja with an explicit count, oblivious to the fact that a jobserver is in place. Doing this allows everything to work transparently without modifying tons of configuration files or scripts in complex multi-build systems.

In theory, you can setup a pool with a depth of 1, but since each client receives an implicit job slot when it starts, whether -j1 ignores the pool or not results in exactly the same behavior whether A) or D) is implemented.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A) sounds pretty reasonable; but I think B) should generate a warning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, there is already an Info() message printed in this case "jobserver detected ...", unless --quiet is specified.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I favor C (if I understand it correctly that it means: with -j always ignore the pool).

  1. It's simple. Less code, less documentation.
  2. It's intuitive: When I specify a job count, I would want ninja to respect that and not override the behavior based on an environment variable.
  3. We don't change behavior with an update. ninja 1.12 uses 42 jobs if you pass -j42, ninja 1.13 should too.
  4. We do change the behavior without -j - but that currently already means "auto detect" and I think it's expectable for that auto detect mechanism to improve in an update.
  5. People might currently use -j<some low number> as a workaround to calm ninja because it doesn't respect the jobserver. We shouldn't encourage them to keep a workaround in, so encourage the removal.

src/build.h Show resolved Hide resolved
src/jobserver-win32.cc Show resolved Hide resolved
src/jobserver.h Outdated Show resolved Hide resolved
src/ninja.cc Outdated Show resolved Hide resolved
@digit-google digit-google force-pushed the jobserver branch 2 times, most recently from d837d97 to be139bc Compare October 23, 2024 18:14
@digit-google
Copy link
Contributor Author

To answer @kepstin's question. Client mode is already setup automatically by inspecting the MAKEFLAGS variable. If it detects that Ninja is called within the context of another jobserver pool, it will automatically use it to control how it dispatches parallel jobs, and will also pass the variable to sub-commands as well to let them participate in the coordination protocol.

On the other hand, starting the pool in Ninja automatically seems risky to me, because it can change much more than Ninja's behavior (all sub-commands as well). For example, there are rare cases (unbalanced builds) where this will result in a slower build overall (e.g. as described here).

That's why I believe the NINJA_JOBSERVER environment variable is a good compromise here. You can set it to 1 and Ninja will automatically start a pool (if not already running in a client context). It is something that also allows this to work without modifying build systems and scripts, or wrappers such a cmake --build or meson compile and many others).

I know @jhasse dislikes environment variables, but I believe that this is one of the rare cases where the benefits outweigh the annoyances.

@digit-google
Copy link
Contributor Author

To answer @jhasse's latest comment now, client mode is already started automatically by looking at MAKEFLAGS, there is nothing to do to use it, and NINJA_JOBSERVER is only here to control when Ninja implements the pool itself.

Adding a special syntax to the Ninja build plan to indicate that specific actions support the protocol is doable, but it will mean the feature won't be usable until all Ninja generators support it, which may be a veeeeery long time. I don't think it's really useful here. That might be useful to explicitly disable it for certain commands (though on Posix one can simply start the command with MAKEFLAGS= ...., Win32 is more complicated).

For Posix, I think it is far preferrable to use --jobserver-mode=pipe as the default at the moment as many multi-build systems run on older distributions that do not have GNU Make 4.4 yet (which is the one which implements fifo mode). Even my Debian 12-based Linux distribution at work is only providing GNU Make 4.3 today.

I imagine we could switch to fifo transparently in a few years when GNU Make 4.4+ is more widespread. But since there is absolutely no performance difference between the two modes, and that our client implementation supports both transparently, when Ninja is not the pool implementation, pipe seems a reasonable choice for now.

@eli-schwartz
Copy link

eli-schwartz commented Oct 23, 2024

That might be useful to explicitly disable it for certain commands (though on Posix one can simply start the command with MAKEFLAGS= ...., Win32 is more complicated).

This is why ninja should support a dedicated syntax to run a build rule with specific environment variables set -- trivial on POSIX, annoying on Win32 so would require invoking processes two different ways depending on whether the build plan has demanded an environment variable, iirc.

Not all ninja generators permit specifying arbitrary shell command syntax for build rules. For precisely the reason of predictable cross platform behavior.

@jhasse
Copy link
Collaborator

jhasse commented Oct 24, 2024

Even my Debian 12-based Linux distribution at work is only providing GNU Make 4.3 today.

That's not because GNU Make 4.4 is brand new (it's 2 years old actually), but because Debian is such a bad distribution. If we wanted to be pragmatic we could have merged the first PR 8 years ago.

@digit-google
Copy link
Contributor Author

That's not because GNU Make 4.4 is brand new (it's 2 years old actually), but because Debian is such a bad distribution. If we wanted to be pragmatic we could have merged the first PR 8 years ago.

Well, just like build systems there are only two types of Linux distributions: those that people complain about, and those that nobody uses ;-)

@htot
Copy link

htot commented Oct 25, 2024

People: the world is bigger than just your distributions packages. The demand for jobserver support becomes relevant when you are using a build system that builds many different packages using combinations of make, ninja and possibly other methods, like Yocto's bitbake (if you are unfamiliar with that just imagine the build system that builds all of debian's packages). This is why it is unrealistic to convert everything to ninja, there are 1000's of packages to be built.

Now Yocto installs it's own make and ninja version, master contains make 4.4.1. To get the jobserver working in bitbake I patched it based on a suggestion from the Yocto developers, and forked one of stephan's versions. This has been working for at least a year.

I would expect a similar need for jobserver in buildroot.

The importance is that if you have a bitbake build machine that does m parallel makes/ninja's on n cores (+ ht), you are going to need about m x n x 1GB RAM (per user building world).

@nanonyme
Copy link

nanonyme commented Oct 25, 2024

In any case I wouldn't expect stable distributions to pick up 1.12.0 or above as update. The scheduling changes means that some legacy builds break. So this PR is not going to apply to historic but future distro releases. In orher words, parties who get this are highly likely to have new enough make.

@ArsenArsen
Copy link

For Posix, I think it is far preferrable to use --jobserver-mode=pipe as the default at the moment as many multi-build systems run on older distributions that do not have GNU Make 4.4 yet (which is the one which implements fifo mode). Eveno I'm not convinced. my Debian 12-based Linux distribution at work is only providing GNU Make 4.3 today.

Odds are that time will start passing, even for Debian, enough
to get make 4.4 at around the same time or before the hypothetical Ninja release that'd include this patch, though.

So I'm not sure it's actually advantageous to default to pipe. I'm exporting NINJA_JOBSERVER=fifo for my testing.

@digit-google
Copy link
Contributor Author

digit-google commented Oct 31, 2024

[...] then won't be able to enable fifo mode until they ensure that all protocol clients (compiler, linker, tools, whatever) also support this mode [...]

I think this will happen later rather than sooner if we make the impression that the pipe mode is an alternative implementation with equivalent support and not a deprecated old method that shouldn't be used.

I am not sure that this is a problem for Ninja to solve, or that "impression" would work at all.

On the other hand, only supporting fifo mode if Ninja implements the pool seems ok, as there exist alternatives for people who still require a pipe jobserver instead. Such as calling jobserver_pool.py manually. Which works, though is less convenient, which is probably the kind of "nudge" you are looking for?

What do you think about the following changes:

  1. Only mention --jobserver-auth=fifo: in manual.asciidoc

I think the documentation should always reflect what the code does.
Doing otherwise is going to make somebody's life miserable, inevitably.

  1. Remove unit tests for pipe mode.

This is even worse imo :( If that code is never tested, it will 100% bit-rot over time, and a new Ninja version will be released that breaks someone's workflows in painful ways (e.g. where getting to the root cause will take a lot of time).

Unit-tests are here to detect regressions as soon as possible, before the code is released.
If the code is documented and executes at runtime, it should have a unit-test (or at a minimum a regression test).

  1. Remove pipe mode from *.py files (that would be a lot more than 50 lines I think).

No, because we need it for the existing regression test for client pipe mode.

  1. Print a warning when pipe mode is used.

That on the other hand is an excellent suggestion.
We could even provide a link to a page explaining why pipe mode is deprecated in favor of fifo.

@nanonyme
Copy link

How about a build-time flag on whether to enable by default?

@fogti
Copy link

fogti commented Oct 31, 2024

@nanonyme That's a really bad idea, because it makes the behavior of ninja impossible to predict without looking up if that flag is enabled (such flag should then also be reflected in ninja --version or such, to prevent debugging from being a huge hassle), it's better to unconditionally either enable it by default or disable it by default, and switch that only on releases (otherwise, indicate in ninja --version in some way, but beware, some tools parse that, too...).

Like, users expect that such a basic tool as ninja (and make, etc.) act based upon which version (perhaps flavor, e.g. nmake, bmake, etc.), input files, command line arguments and environment are there. Some hidden state like that makes it much harder to find out what is going on.

@htot
Copy link

htot commented Nov 1, 2024

I seem to be completely missing the "job server" point here. What exactly is that? And why would the whole tool chain need to support pipes?

I my use case bitbake starts multiple build jobs in parallel (and 1000s in series). Some are ninja, some are make. ninja and make start multiple compilers, linkers, sub-make and sub-ninja's and what not in parallel.

All I need to do is have bitbake create a named pipe (or a fifo when make did not yet support named pipes) and have bitbake start each build with the correct path to the named pipe (by making sure it is in the MAKEFLAGS env var and make/ninja call is intercepted see).

Nothing more is needed. On my 8 core + ht that will start no more than 16 compilers in parallel and restrict memory use to about 16GB.

What is the build server here? The named pipe? bitbake that creates the named pipe?

The only thing I need for this is "the same what make does", so at this point ninja taking a token from the pipe.

All this is not theory, I use a patched ninja and bitbake for this for over a year and it builds 1000's of packages.

@ArsenArsen
Copy link

what you describe bitbake doing is exactly it acting as a jobserver server (see https://www.gnu.org/software/make/manual/html_node/Job-Slots.html )

ideally, ninja should be able to do that setup independently also

And why would the whole tool chain need to support pipes?

see https://www.gnu.org/software/make/manual/html_node/POSIX-Jobserver.html - so that compilers etc can limit their own job pools together with whatever invoked them

@eli-schwartz
Copy link

What is the build server here? The named pipe? bitbake that creates the named pipe?

The jobserver is the thing that creates the named pipe. For you, that is bitbake.

The only thing I need for this is "the same what make does", so at this point ninja taking a token from the pipe.

... but for most people, that is make! Make is the thing that creates the named pipe, because when users download a program to their home directory and cd to it and run ./configure && make, the compiler is also threaded and needs to know how many jobs it can get from make.

And the same applies to projects where you run meson setup builddir && ninja -C builddir. Or cmake -GNinja && ninja. There is no bitbake involved, so users using ninja need some way to get a jobserver too.

I'm very happy for you that as a bitbake user, you don't need a jobserver! Unfortunately, most people aren't using bitbake because there are hundreds of thousands of use cases for building software and only one of them is bitbake. The rest of us need a solution too. :)

@htot
Copy link

htot commented Nov 1, 2024

Exactly! If I can create a small patch in bitbake (which is python code), to create a named pipe and fill it with the correct number of tokens, that is trivial code! Afaik make does the same. Jobserver is a very big word for that. Why remove that code? Even why spend more word on removing it then on the code itself.

As long as the jobserver is only used for the top make and automatically not for sub-makes (which it does) I'm happy.

I would expect exactly the same behavior for ninja. Unless people here prefer to have ninja called as a sub-ninja from a make jobserver.

BTW most make and ninja users likely use one of both to build possibly multiple tools in sub-projects under there own control. OTOH there are maybe just a few build systems, like bitbake, buildroot and whatever debian uses. But they build lots of packages. The importance of a single job pool to coordinate the number of parallel builds is very much higher for these machines. There's a scaling problem. If you have u users building simultaneously, m packages and n cores (+ht) you need about u x m x n x 1GB of memory. With coordination only n x 1GB . (Well actually if you have multiple linkers running doing LTO you need crazy amounts of memory per linker, I had to separately restrict that for building nodejs or otherwise it spawns 5 5GB size link jobs simultaneously, while bitbake does not only build the target image but also the cross-compiler, so nodejs is built twice. And as that sort of thing goes the 2 slowest things always end up running simultaneously at the end of the image build, causing hours of disk thrashing).

@htot
Copy link

htot commented Nov 1, 2024

the compiler is also threaded and needs to know how many jobs it can get from make.

I really don't know if the compiler can compile multiple sources from each thread (I don't know how many threads it has). What I see is make spawning multiple compiler processes. And this makes sense too because the compiler is run each time which a command line appropriate for that source file.

Like I said, the problem is not really due to the scheduling of the massive amount of jobs (I measured, it has negligible effect on the performance), the problem is due to the amount of memory allocated for each compiler process and if there is not enough the resulting disk thrashing.

@kepstin
Copy link

kepstin commented Nov 2, 2024

I really don't know if the compiler can compile multiple sources from each thread (I don't know how many threads it has). What I see is make spawning multiple compiler processes. And this makes sense too because the compiler is run each time which a command line appropriate for that source file.

At the moment, I know of 2 compilers which support parallelism within the compiler, and also support being clients of a gnu make-compatible job server to limit parallelism. They are:

  1. GCC, but only when doing LTO builds. A portion of the optimization/compilation is deferred to the linking step - where the build system has run only one "link" command - but during the link step, GCC will start multiple sub-processes to do the compilation in parallel. Using the jobserver helps limit max memory usage when running multiple LTO link commands in parallel.
  2. rustc. It builds a whole "package" at a time with one command that can read many source files, but it can split the compilation into multiple codegen units using threads. It uses the jobserver to control how many threads to start.

@digit-google
Copy link
Contributor Author

I just pushed a new version of the PR where:

  • Client mode is now disabled for dry-run mode, or if any job count is passed on the command-line.
    Note that, unlike Make, Ninja does not support -j (without a job count).

  • A warning is printed on Posix when the pipe mode is detected, encouraging the use of a fifo pool instead.

  • Manual and regression tests were updated accordingly.

Let me know if there is something missing to get this submitted!

@ArsenArsen
Copy link

I do not see the server mode again - as mentioned above, that is very significant for many people

@digit-google
Copy link
Contributor Author

As discussed in previous comments, client support will be landed first, and we'll continue discussing the best way to implement server support after that. You can look at my branch at https://github.com/digit-google/ninja/tree/jobserver-pool to see where the server code is for now.

@nanonyme
Copy link

As discussed in previous comments, client support will be landed first, and we'll continue discussing the best way to implement server support after that. You can look at my branch at https://github.com/digit-google/ninja/tree/jobserver-pool to see where the server code is for now.

Since you appear to be following @jhasse 's suggestion to split into separate client and server PR's, probably need to update title and description to make it clear this is the part 1 which is client mode. Then it's more clear that server mode is not completely abandoned but just scoped out of the initial PR.

@digit-google digit-google changed the title Implement GNU Make jobserver protocol support in Ninja (client + server modes) Implement GNU Make jobserver client protocol support in Ninja Nov 11, 2024
@digit-google
Copy link
Contributor Author

@nanonyme: Good points, done!

This implements a GNU jobserver token pool that will be used
for testing the upcoming jobserver Ninja client implementation.

Note that the implementation is basic and doesn't try to deal
with broken protocol clients (which release more tokens than
they acquired). Supporting them would require something vastly
more complex that would monitor the state of the pipe/fifo
at all times.
This adds two new classes related to GNU jobserver support
and related unit-tests:

`Jobserver::Slot` models a single job slot, which includes both
the "implicit" slot value assigned to each process spawned
by Make (or the top-level pool implementation), as well as
"explicit" values that come from the Posix pipe, or Win32
semaphore decrements.

`Jobserver::Config` models the Jobserver pool implementation
to use based on the value of the `MAKEFLAGS` environment
variable.
This adds a new interface class for jobserver clients,
providing a way to acquire and release job slots easily.

Creating a concrete instance takes a Jobserver::Config as
argument, which is used to pick the appropriate implementation
and initialize it.

This commit includes both Posix and Win32 implementations.
@digit-google digit-google force-pushed the jobserver branch 2 times, most recently from 013791c to 981bff4 Compare November 11, 2024 13:08
Detect that the environment variable MAKEFLAGS specifies a
jobserver pool to use, and automatically use it to control
build parallelism when this is the case.

This is disabled is `--dry-run` or an explicit `-j<COUNT>`
is passed on the command-line. Note that the `-l` option
used to limit dispatch based on the overall load factor
will still be in effect if used.

+ Use default member initialization for BuildConfig struct.

+ Add a new regression test suite that uses the
  misc/jobserver_pool.py script that was introduced in
  a previous commit, to verify that everything works
  properly.
@digit-google
Copy link
Contributor Author

Regarding the last pushes, I had to modify the parallelism of the jobserver test from 10 to 4 parallel tasks, as it failed on CI builders, which probably run on low-powered VMs.

@nanonyme
Copy link

For the record, I think even the client-mode alone will already be highly useful for the use case when there is a build orchestrator which runs multiple builds in parallel and limits overall CPU usage by mounting the fifo into multiple isolated build sandboxes. Looking forward to this landing.

@chriselrod
Copy link

chriselrod commented Nov 14, 2024

For the record, I think even the client-mode alone will already be highly useful for the use case when there is a build orchestrator which runs multiple builds in parallel and limits overall CPU usage by mounting the fifo into multiple isolated build sandboxes. Looking forward to this landing.

I'm using cmake as my build system for a library with a bunch of feature-level #ifdefs (e.g., different code paths for avx512f, avx2, generic, etc), so I use make to configure and run different builds with the different feature levels, and using gcc vs clang. When using ninja as the build system, I need to make -j1 on low core count systems. I'm looking forward to being able to make -j(nproc).

@nanonyme
Copy link

nanonyme commented Nov 14, 2024

For the record, I think even the client-mode alone will already be highly useful for the use case when there is a build orchestrator which runs multiple builds in parallel and limits overall CPU usage by mounting the fifo into multiple isolated build sandboxes. Looking forward to this landing.

I'm using cmake as my build system for a library with a bunch of feature-level #ifdefs (e.g., different code paths for avx512f, avx2, generic, etc), so I use make to configure and run different builds with the different feature levels, and using gcc vs clang. When using ninja as the build system, I need to make -j1 on low core count systems. I'm looking forward to being able to make -j(nproc).

Sure. It just came up here but was not said explicitly that there are users who want to build multiple autotools, cmake and meson projects in parallel through build orchestration tooling. For this to work without overloading system, all make and ninja processes must be in jobserver client mode.

@jhasse jhasse added this to the 1.13.0 milestone Nov 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.