Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix infinite loop in SubprocessSet::DoWork() #2484

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

dendy
Copy link

@dendy dendy commented Aug 28, 2024

In case fd is invalid loop will be infinite because iterator is not incremented.

Make cycle more obvious by adding regular iterator into for() loop and move Subprocess from running_ into finished_ in separate loop.

Copy link
Contributor

@digit-google digit-google left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, this changes the order of calls to OnPipeReady() and Done() called at runtime, making them inconsistent with the !USE_PPOLL implementation. This may bite us in the future when refactoring this code.

Can you instead just add the missing increment in the if (fd < 0) continue statement (e.g. if (fd < 0) { ++i; continue }), which will have the same effect with a simpler change.

In case fd is invalid loop will be infinite because iterator is not
incremented.

Make loop consistent between both variants of SubprocessSet::DoWork()
(with and withoug USE_PPOLL) by removing 'continue' and always advancing
iterator in the end of the cycle.
@dendy dendy force-pushed the fix-infinite-loop-in-SubprocessSet-DoWork branch from ad18d33 to a6da182 Compare August 28, 2024 16:42
@digit-google
Copy link
Contributor

You are making the code more complicated for now good reason. Please simplify as suggested instead. @jhasse, wdyt?

@dendy
Copy link
Author

dendy commented Aug 28, 2024

Code should be more clear now, since there is no continue statement, and consistent between two variants of DoWork(). I prepared it in this way to simplify the subsequent patch for stderr implementation, which adds handling of stderr fd to the loops.

@digit-google
Copy link
Contributor

No. You are changing the behavior of the code by calling the Done() method in cases where it wasn't called before. This is wasteful and potentially dangerous.

This is not what this patch is supposed to do, which would be to avoid fixing a potential infinite loop, and which can be trivially done by adding a missing increment in one line of source file.

Please keep it simple.

@dendy
Copy link
Author

dendy commented Aug 28, 2024

To be honest, I do not see how this condition should be met in first place:

if (fd < 0) continue;

I believe that is a leftover from some older implementation. And correct fix would be to replace this condition with assert. As long as Subprocess in in the running_ list its fd_ must always be valid. Please correct me if I am wrong, because I cannot see where else it might be changing outside of OnPipeReady().

If the above is true, then behavior was not changed, and Done() call will return false when OnPipeReady() was not called.

The only reason I moved Done() to the end of the loop, is to prepare for the subsequent addition of the second error fd to handle. If you prefer not to do that in within this patch, then I believe the correct fix for this would be to replace the condition above with:

assert(fd >= 0)

@dendy
Copy link
Author

dendy commented Aug 28, 2024

To clarify, when second error fd will be added, Subprocess will stay in the running_ list until both fds are cleared. And the condition check above in runtime makes sense. But with a single fd the condition will always fail. That is why you never see ninja to freeze in the current implementation, because fd is always valid.

@digit-google
Copy link
Contributor

First, do not sneak unrelated behavior-changing code into a patch that is supposed to address a trivial issue. Not only this is a very easy way to introduce subtle bugs, or worse security issues, but your are also wasting time for you and reviewers, as well as lower the chances that your future contributions will be considered of appropriate quality. That being said, mistakes happen.

Second, looking at the code, it looks like that given the current implementation, the fd_ value would always be valid when these loop run. However, these details are pretty subtle, as well as the fact that the DoWork() function mixes low-level file event handling and management of the running_ and finished_ sub-process queues. It is easy to imagine an unrelated change elsewhere in this source file that would trigger the bug flakily, so fixing it has value.

assert() statements are no-ops in production code, so would not solve a potential infinite loop issue. It is better to call Fatal() if the condition is detected, with a hint explaining what's going on instead, or simply ignore such file descriptors in the loop.

If you intend to modify the behavior of this function in non-trivial way, I suggest you do that in a dedicated PR that clearly explain the rationale for it, and hopefully provides separate commits to address changes in a logical order. And don't forget unit or regression tests for the new feature / behavior.

Regarding the topic of supporting two separate event descriptors per sub-process, I would recommend first decoupling the file event wait logic from the queue management one (e.g. provide a PosixFileWatcher class that provides clients the ability to register callbacks to be invoked when an event is detected, with the ability to add / remove registrations within the callback function itself for safety). That will make the code much clearer to understand and maintain over time, while allowing unit-testing for different types of conditions that are not possible with the current code.

Apart from that, I recommend you either drop this PR completely, or just fix the actual bug by adding a missing increment.

@dendy
Copy link
Author

dendy commented Aug 28, 2024

There are no behavior changes within this patch.

I believe the biggest confusion is in title, it should rather say "Cleanup" than "Fix", because there was no runtime issue within the current code. I already put that description into the commit body, but I would also change the commit title to make it clear.

Justification for this code cleanup is to prepare the DoWork() implementation for the 2 fds handling simultaneously.

I believe current implementation of the DoWork() is fine with regards to multiple fd event handling, and extending it with 2 fds support is trivial, here is how such a change will look after this patch: f8ce671#diff-e70adfab80664f71832232d71b3055c78e054dd75fbeec69d019cb0e9dce6b2dL283

Since fd_ is supposed to be always valid inside the running_ list, assert() should be the correct way of handling it. Fatal() is for catching runtime issues, assert() is for source code bugs.

I will make appropriate changes, and will rephrase this PR description.

@mcprat
Copy link
Contributor

mcprat commented Aug 29, 2024

the explanation doesn't need to be so complex

the solution is to add instances of ++i wherever they are missing

blocking of error conditions is preferred over the acceptance of non-error conditions, because otherwise the rest of the function is subject to fallthrough conditions

Comment on lines 256 to 257
for (vector<Subprocess*>::iterator i = running_.begin();
i != running_.end(); ++i) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by the way this one has the increment anyway

Comment on lines 334 to 335
if (fd >= 0 && FD_ISSET(fd, &set)) {
(*i)->OnPipeReady();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is existing already, changes below have nothing to do with the fd value...

if an infinite loop is still possible after the fd value is handled then continue should cause increment of i, like how a normal for loop is written...

are we sure that omitting the increment in the loop itself isn't the actual mistake and all you need to do is put it back?

@dendy
Copy link
Author

dendy commented Aug 29, 2024

by the way this one has the increment anyway

In that loop running_ list is immutable, thus it increments iterator with ++i on each cycle.

The code we are talking about is unreachable:
if (fd < 0) continue;

This "fixup" patch was originally created in 2015, at that time I assumed that there may be more conditions when Subprocess fd_ is closed, but Subprocess itself may be still running. That thought also matched the feature I was working on: adding second error_fd_ for tracking the error output, which could be still open.

By checking the code now I may claim that by design there is no condition when running_ list may contain Subprocess, which Done() == true. There is also a test case that validates this: SubprocessTest.SetWithMulti. Thus that check for the if (fd < 0) is unreachable and dead, and need to be removed or replaced with an assert().

@mcprat
Copy link
Contributor

mcprat commented Aug 29, 2024

by the way this one has the increment anyway

In that loop running_ list is immutable

doesn't matter, the specific loop i'm referring to only reads from running_

By checking the code now I may claim that by design there is no condition when running_ list may contain Subprocess, which Done() == true. There is also a test case that validates this: SubprocessTest.SetWithMulti. Thus that check for the if (fd < 0) is unreachable and dead, and need to be removed or replaced with an assert().

This is likely true in real-life cases, but perhaps the reason that the fd check was put there to begin with was for the crazy unrealistic scenario of the system running out of available FDs to open. Without a supercomputer it would take a patched kernel to verify that.

@dendy
Copy link
Author

dendy commented Aug 29, 2024

crazy unrealistic scenario of the system running out of available FDs to open

There is validation for that condition in code in 2 places: pipe() syscall in Subprocess::Start() and SubprocessTest.SetWithMulti. So it is still not possible to have undone Subprocess in the running_ list.

@mcprat
Copy link
Contributor

mcprat commented Aug 29, 2024

simply removing the fd checks from the loops is likely acceptable then

@jhasse
Copy link
Collaborator

jhasse commented Aug 30, 2024

puh, sorry just skimmed over most of it. When exactly is this infinite loop triggered?

@dendy
Copy link
Author

dendy commented Aug 30, 2024

When exactly is this infinite loop triggered?

It is never triggered with current implementation, and per my understanding that is by design that running_ list may contain finished Subprocess instances. Thus this is not a fix, but rather cleanup of the old dead code. I will change the description and update the commit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants