-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement GNU Make jobserver client protocol support in Ninja #2506
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor comments on the code; I haven't actually tried your branch.
2e12658
to
d80c054
Compare
To follow this important PR. |
d80c054
to
be0ea05
Compare
Does this work together with -flto=auto such that ninja limits flto threads when working as job server? |
for gcc, it should yes. the gcc lto wrapper basically emits a makefile and calls make on it to run the ltrans jobs, and that sub-make will take jobserver tokens from the ninja jobserver in this pr. for clang, there is no equivalent- no part of the thinlto process or lld itself takes jobserver tokens (but maybe i missed something), that would require implementing the protocol (reading env vars, having lld and perhaps also the linker plugin used in other linkers spawn threads only based on grabbed tokens, ..) |
Just to report: I'm developing laze, a build system calling out to Ninja for actual building. One of laze's core features is its native support for large build matrizes of mostly similar build configurations (e.g., build something for dozens to hundreds of slightly different embedded devices). In one project, there's a test application that calls out to cmake for building a library, and cmake calls Ninja recursively. So when building this application for 10 boards at the same time (one laze command resulting in one Ninja call), previously, this would on my laptop call 10 cmake+Ninja simultaneously, and each sub-Ninja would run up to 10 gcc instance. While there's enough RAM so this actually finishes, my laptop becomes unresponsive, typing laggy, ... Using the Ninja binary as built by this PRs CI output and with I gave this a hyperfine run to see if total build time is affected, and it seems like the jobserver version is slightly slower, within margin of error:
(but, the jobserver version did not make typing here laggy ... 🙂) |
Using
Also, setting With |
I gave a test on the VLC contrib build which builds in parallel more than a hundred libraries using autotools (make), CMake (ninja) and meson (ninja). The maximum number of threads seems to be respected on my local machine. In the CI things are working properly as well in Debian and Ubuntu. On one machine it logs:
On the other:
|
This sounds like an inconvenient limitation. Why couldn't ninja be jobserver with environment variable when explicit -jN is set? |
Thanks, the reason why Besides that, the I am going to get rid of the problem by making Apart from that, @nanonyme is correct that this was to avoid child processes to become jobserver themselves by accident. However, this can be solved by ensuring that Quick question regarding behavior: Currently:
If anyone thinks this is not reasonable or would create a problem for their workflow, let me know. I selected these conditions on a hunch since I never use these myself (well except |
4c73cd6
to
146c55d
Compare
Is there any chance that jobserver mode could be made "automatic" by default? I.e. act as a jobserver client if a jobserver is available from the environment, otherwise (if This would simplify the use of ninja quite a bit - you don't have to worry about remembering to set an extra ninja command-line parameter or environment variable to ensure that recursive builds, rust, gcc lto., etc. parallelize properly. |
Hm ... for automatic detection of the client: Should this be done by checking MAKEFLAGS? Some more general comments about this PR:
|
doc/manual.asciidoc
Outdated
|
||
- Dry-run (i.e. `-n` or `--dry-run`) is not enabled. | ||
|
||
- `-j1` (no parallelism) is not used on the command line. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm ... why though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because -j1
is how you tell Ninja you do not want to launch parallel jobs at all. So using it in the current implementation disables both client and pool mode at the same time (it doesn't make sense to use a pool of one job slot).
Similarly, -j0
means "infinite parallelism", so it disables pool mode, but not client mode.
I can change that if you prefer, but I believe these are sane defaults. I clarified this behavior in the documentation though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But as a client wouldn't couldn't I still be limited by jobserver? I.e. waiting on a token to become available. But I guess that isn't supported yet as nothing will wake up ninjas main loop in that case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The possible behaviors are the following when a jobserver pool is in place in MAKEFLAGS
:
A) -j1
disables parallel tasks and ignores the pool
B) -j2
and above (also -j0
) ignore the command-line job count, and use the pool to control parallelism.
C) Without a command-line -j
parameter, use the pool.
D) -j1
is ignored, Ninja uses the pool to control parallelism.
E) -j2
and above (also -j0
) ignore the pool and use the command-line job count instead.
The current PR implements A) and B) because -j1
is the only way to tell Ninja that we do not want parallelism, and the only reason to do that would be for debugging a build or because the system is very constrained (e.g. not enough RAM). Hence, ignoring it if a jobserver pool is in place seems unhelpful.
The reason B) exists (explicit job counts are ignored when the pool is in place) is because many build scripts will invoke Ninja with an explicit count, oblivious to the fact that a jobserver is in place. Doing this allows everything to work transparently without modifying tons of configuration files or scripts in complex multi-build systems.
In theory, you can setup a pool with a depth of 1, but since each client receives an implicit job slot when it starts, whether -j1
ignores the pool or not results in exactly the same behavior whether A) or D) is implemented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A) sounds pretty reasonable; but I think B) should generate a warning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, there is already an Info()
message printed in this case "jobserver detected ...", unless --quiet
is specified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I favor C (if I understand it correctly that it means: with -j
always ignore the pool).
- It's simple. Less code, less documentation.
- It's intuitive: When I specify a job count, I would want ninja to respect that and not override the behavior based on an environment variable.
- We don't change behavior with an update. ninja 1.12 uses 42 jobs if you pass
-j42
, ninja 1.13 should too. - We do change the behavior without
-j
- but that currently already means "auto detect" and I think it's expectable for that auto detect mechanism to improve in an update. - People might currently use
-j<some low number>
as a workaround to calm ninja because it doesn't respect the jobserver. We shouldn't encourage them to keep a workaround in, so encourage the removal.
d837d97
to
be139bc
Compare
To answer @kepstin's question. Client mode is already setup automatically by inspecting the On the other hand, starting the pool in Ninja automatically seems risky to me, because it can change much more than Ninja's behavior (all sub-commands as well). For example, there are rare cases (unbalanced builds) where this will result in a slower build overall (e.g. as described here). That's why I believe the I know @jhasse dislikes environment variables, but I believe that this is one of the rare cases where the benefits outweigh the annoyances. |
To answer @jhasse's latest comment now, client mode is already started automatically by looking at Adding a special syntax to the Ninja build plan to indicate that specific actions support the protocol is doable, but it will mean the feature won't be usable until all Ninja generators support it, which may be a veeeeery long time. I don't think it's really useful here. That might be useful to explicitly disable it for certain commands (though on Posix one can simply start the command with For Posix, I think it is far preferrable to use I imagine we could switch to |
This is why ninja should support a dedicated syntax to run a build rule with specific environment variables set -- trivial on POSIX, annoying on Win32 so would require invoking processes two different ways depending on whether the build plan has demanded an environment variable, iirc. Not all ninja generators permit specifying arbitrary shell command syntax for build rules. For precisely the reason of predictable cross platform behavior. |
That's not because GNU Make 4.4 is brand new (it's 2 years old actually), but because Debian is such a bad distribution. If we wanted to be pragmatic we could have merged the first PR 8 years ago. |
Well, just like build systems there are only two types of Linux distributions: those that people complain about, and those that nobody uses ;-) |
People: the world is bigger than just your distributions packages. The demand for jobserver support becomes relevant when you are using a build system that builds many different packages using combinations of make, ninja and possibly other methods, like Yocto's Now Yocto installs it's own I would expect a similar need for jobserver in The importance is that if you have a |
In any case I wouldn't expect stable distributions to pick up 1.12.0 or above as update. The scheduling changes means that some legacy builds break. So this PR is not going to apply to historic but future distro releases. In orher words, parties who get this are highly likely to have new enough make. |
Odds are that time will start passing, even for Debian, enough So I'm not sure it's actually advantageous to default to pipe. I'm exporting NINJA_JOBSERVER=fifo for my testing. |
I am not sure that this is a problem for Ninja to solve, or that "impression" would work at all. On the other hand, only supporting
I think the documentation should always reflect what the code does.
This is even worse imo :( If that code is never tested, it will 100% bit-rot over time, and a new Ninja version will be released that breaks someone's workflows in painful ways (e.g. where getting to the root cause will take a lot of time). Unit-tests are here to detect regressions as soon as possible, before the code is released.
No, because we need it for the existing regression test for client
That on the other hand is an excellent suggestion. |
How about a build-time flag on whether to enable by default? |
@nanonyme That's a really bad idea, because it makes the behavior of Like, users expect that such a basic tool as |
I seem to be completely missing the "job server" point here. What exactly is that? And why would the whole tool chain need to support pipes? I my use case All I need to do is have Nothing more is needed. On my 8 core + ht that will start no more than 16 compilers in parallel and restrict memory use to about 16GB. What is the build server here? The named pipe? The only thing I need for this is "the same what make does", so at this point All this is not theory, I use a patched |
what you describe bitbake doing is exactly it acting as a jobserver server (see https://www.gnu.org/software/make/manual/html_node/Job-Slots.html ) ideally, ninja should be able to do that setup independently also
see https://www.gnu.org/software/make/manual/html_node/POSIX-Jobserver.html - so that compilers etc can limit their own job pools together with whatever invoked them |
The jobserver is the thing that creates the named pipe. For you, that is bitbake.
... but for most people, that is And the same applies to projects where you run I'm very happy for you that as a bitbake user, you don't need a jobserver! Unfortunately, most people aren't using bitbake because there are hundreds of thousands of use cases for building software and only one of them is bitbake. The rest of us need a solution too. :) |
Exactly! If I can create a small patch in As long as the jobserver is only used for the top I would expect exactly the same behavior for BTW most |
I really don't know if the compiler can compile multiple sources from each thread (I don't know how many threads it has). What I see is Like I said, the problem is not really due to the scheduling of the massive amount of jobs (I measured, it has negligible effect on the performance), the problem is due to the amount of memory allocated for each compiler process and if there is not enough the resulting disk thrashing. |
At the moment, I know of 2 compilers which support parallelism within the compiler, and also support being clients of a gnu make-compatible job server to limit parallelism. They are:
|
34398e7
to
1b327ac
Compare
I just pushed a new version of the PR where:
Let me know if there is something missing to get this submitted! |
I do not see the server mode again - as mentioned above, that is very significant for many people |
As discussed in previous comments, client support will be landed first, and we'll continue discussing the best way to implement server support after that. You can look at my branch at https://github.com/digit-google/ninja/tree/jobserver-pool to see where the server code is for now. |
Since you appear to be following @jhasse 's suggestion to split into separate client and server PR's, probably need to update title and description to make it clear this is the part 1 which is client mode. Then it's more clear that server mode is not completely abandoned but just scoped out of the initial PR. |
@nanonyme: Good points, done! |
This implements a GNU jobserver token pool that will be used for testing the upcoming jobserver Ninja client implementation. Note that the implementation is basic and doesn't try to deal with broken protocol clients (which release more tokens than they acquired). Supporting them would require something vastly more complex that would monitor the state of the pipe/fifo at all times.
This adds two new classes related to GNU jobserver support and related unit-tests: `Jobserver::Slot` models a single job slot, which includes both the "implicit" slot value assigned to each process spawned by Make (or the top-level pool implementation), as well as "explicit" values that come from the Posix pipe, or Win32 semaphore decrements. `Jobserver::Config` models the Jobserver pool implementation to use based on the value of the `MAKEFLAGS` environment variable.
This adds a new interface class for jobserver clients, providing a way to acquire and release job slots easily. Creating a concrete instance takes a Jobserver::Config as argument, which is used to pick the appropriate implementation and initialize it. This commit includes both Posix and Win32 implementations.
013791c
to
981bff4
Compare
Detect that the environment variable MAKEFLAGS specifies a jobserver pool to use, and automatically use it to control build parallelism when this is the case. This is disabled is `--dry-run` or an explicit `-j<COUNT>` is passed on the command-line. Note that the `-l` option used to limit dispatch based on the overall load factor will still be in effect if used. + Use default member initialization for BuildConfig struct. + Add a new regression test suite that uses the misc/jobserver_pool.py script that was introduced in a previous commit, to verify that everything works properly.
981bff4
to
3c347f5
Compare
Regarding the last pushes, I had to modify the parallelism of the jobserver test from 10 to 4 parallel tasks, as it failed on CI builders, which probably run on low-powered VMs. |
For the record, I think even the client-mode alone will already be highly useful for the use case when there is a build orchestrator which runs multiple builds in parallel and limits overall CPU usage by mounting the fifo into multiple isolated build sandboxes. Looking forward to this landing. |
I'm using cmake as my build system for a library with a bunch of feature-level |
Sure. It just came up here but was not said explicitly that there are users who want to build multiple autotools, cmake and meson projects in parallel through build orchestration tooling. For this to work without overloading system, all make and ninja processes must be in jobserver client mode. |
This PR implements the GNU Make jobserver client protocol in Ninja. This implements all client modes for Posix (
pipe
andfifo
) and Windows (semaphore
).This protocol allows several participating processes to coordinate parallel build tasks / processing work.
For example, GNU Make, and now Ninja, use it to control how many parallel commands they dispatch at any given time. The Rust compiler and linker, and some C++ compilers (e.g. Clang has
-flto=jobserver
), use that to control how many parallel threads in a single invocation. The protocol is also implemented by thecargo
Rust tool (but the latter only sets theCARGO_MAKEFLAGS
environment variable).Client mode is useful when Ninja is invoked as part of a more complex build, that launches several build tasks in parallel (e.g. recursive Make or CMake invocations). In this mode, Ninja detects that MAKEFLAGS contains
--jobserver-auth
or--jobserver-fds
options, and uses the job slot pool to control its own dispatch of parallel build commands. It also passes theMAKEFLAGS
value to child processes to let them participate in the coordination protocol.This also includes a new script
misc/jobserver_pool.py
that can be used as a standalone job slot pool implementation, which can be used any client directly for testing.This has been tested on large Fuchsia build plans, with certain build configurations that launch 24 sub-Ninja builds from a top-level Ninja build plan. With remote builders enabled, this reduces the total time from 22minutes to 12minutes.
This work is inspired by contributions from many other developers, including @hundeboll (see PR #2450), @mcprat (see PR #2474) and @stefanb2 (PR #1140) to name a few.
EDIT: (Removed mention of server mode as this has been pushed to a future PR).