Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase thread safety to fix RBatchGenerator tutorials segfaults #13302

Merged
merged 6 commits into from
Sep 8, 2023

Conversation

vepadulano
Copy link
Member

@vepadulano vepadulano commented Jul 23, 2023

Attempting to improve the situation regarding segfaults seen in the RBatchGenerator tutorials. Generally, the situation triggering the segfaults is:

  1. Main process is running Python code
  2. Spawns a C++ thread, which runs RDataFrame code
  3. Main thread will call into the interpreter via cppyy
  4. Data races happen

UPDATE:

The commits were update after discussion with @pcanal. The general strategy is to write the locks further deep in the call chain when possible (e.g. in the context of this PR, TFunction is a higher-level class than TClingMethodInfo, so the locks are better placed in the latter).

@sonatype-lift
Copy link

sonatype-lift bot commented Jul 23, 2023

Sonatype Lift is retiring

Sonatype Lift will be retiring on Sep 12, 2023, with its analysis stopping on Aug 12, 2023. We understand that this news may come as a disappointment, and Sonatype is committed to helping you transition off it seamlessly. If you’d like to retain your data, please export your issues from the web console.
We are extremely grateful and thank you for your support over the years.

📖 Read about the impacts and timeline

@vepadulano vepadulano changed the title [skip-ci] Increase thread safety to fix RBatchGenerator tutorials segfaults Increase thread safety to fix RBatchGenerator tutorials segfaults Jul 23, 2023
@vepadulano
Copy link
Member Author

@phsft-bot build with flags -DCTEST_TEST_EXCLUDE_NONE=ON

@phsft-bot
Copy link
Collaborator

Starting build on ROOT-debian10-i386/soversion, ROOT-performance-centos8-multicore/cxx17, ROOT-ubuntu2204/nortcxxmod, ROOT-ubuntu2004/python3, mac11/noimt, mac12arm/cxx20, windows10/cxx14 with flags -DCTEST_TEST_EXCLUDE_NONE=ON
How to customize builds

@phsft-bot
Copy link
Collaborator

Build failed on mac11/noimt.
Running on macphsft20.dyndns.cern.ch:/Users/sftnight/build/workspace/root-pullrequests-build
See console output.

Failing tests:

@phsft-bot
Copy link
Collaborator

Build failed on windows10/cxx14.
Running on null:C:\build\workspace\root-pullrequests-build
See console output.

Failing tests:

@github-actions
Copy link

github-actions bot commented Jul 23, 2023

Test Results

       10 files         10 suites   1d 17h 12m 25s ⏱️
  2 484 tests      523 ✔️   0 💤 1 961
23 775 runs  19 808 ✔️ 82 💤 3 885

For more details on these failures, see this check.

Results for commit dd3a6cc.

♻️ This comment has been updated with latest results.

@vepadulano
Copy link
Member Author

@phsft-bot build just on mac11/noimt with flags -DCTEST_TEST_EXCLUDE_NONE=ON -DCMAKE_BUILD_TYPE=Debug -DLLVM_BUILD_TYPE=Debug

@phsft-bot
Copy link
Collaborator

Starting build on mac11/noimt with flags -DCTEST_TEST_EXCLUDE_NONE=ON -DCMAKE_BUILD_TYPE=Debug -DLLVM_BUILD_TYPE=Debug
How to customize builds

@vepadulano
Copy link
Member Author

The failures in the new CI only refer to windows, other builds are "green" but the tutorials are not actually run. I need to understand how to force running also these tutorials in the new CI

@Axel-Naumann
Copy link
Member

This absolutely needs @pcanal 's review.

@phsft-bot
Copy link
Collaborator

Build failed on mac11/noimt.
Running on macphsft23.dyndns.cern.ch:/Users/sftnight/build/workspace/root-pullrequests-build
See console output.

@vepadulano
Copy link
Member Author

@phsft-bot build just on mac11/noimt with flags -DCTEST_TEST_EXCLUDE_NONE=ON -DCMAKE_BUILD_TYPE=Debug -DLLVM_BUILD_TYPE=Debug

@phsft-bot
Copy link
Collaborator

Starting build on mac11/noimt with flags -DCTEST_TEST_EXCLUDE_NONE=ON -DCMAKE_BUILD_TYPE=Debug -DLLVM_BUILD_TYPE=Debug
How to customize builds

@phsft-bot
Copy link
Collaborator

Build failed on mac11/noimt.
Running on macphsft23.dyndns.cern.ch:/Users/sftnight/build/workspace/root-pullrequests-build
See console output.

@vepadulano vepadulano changed the title Increase thread safety to fix RBatchGenerator tutorials segfaults [skip-ci] Increase thread safety to fix RBatchGenerator tutorials segfaults Jul 29, 2023
@vepadulano vepadulano changed the title [skip-ci] Increase thread safety to fix RBatchGenerator tutorials segfaults Increase thread safety to fix RBatchGenerator tutorials segfaults Jul 29, 2023
@vepadulano
Copy link
Member Author

@phsft-bot build with flags -DCTEST_TEST_EXCLUDE_NONE=ON

@phsft-bot
Copy link
Collaborator

Starting build on ROOT-debian10-i386/soversion, ROOT-performance-centos8-multicore/cxx17, ROOT-ubuntu2204/nortcxxmod, ROOT-ubuntu2004/python3, mac11/noimt, mac12arm/cxx20, windows10/cxx14 with flags -DCTEST_TEST_EXCLUDE_NONE=ON
How to customize builds

@phsft-bot
Copy link
Collaborator

Build failed on ROOT-ubuntu2004/python3.
Running on root-ubuntu-2004-1.cern.ch:/home/sftnight/build/workspace/root-pullrequests-build
See console output.

Failing tests:

@phsft-bot
Copy link
Collaborator

Build failed on mac12arm/cxx20.
Running on macphsft26.dyndns.cern.ch:/Users/sftnight/build/workspace/root-pullrequests-build
See console output.

Failing tests:

@phsft-bot
Copy link
Collaborator

Starting build on ROOT-performance-centos8-multicore/soversion, ROOT-ubuntu2204/nortcxxmod, ROOT-ubuntu2004/python3, mac11/noimt, mac12arm/cxx20, windows10/default
How to customize builds

@vepadulano
Copy link
Member Author

@pcanal After some discussion with @bellenot we decided to disable the tests on windows for now. Tests on other platforms are passing. Could you give your opinion on these changes?

@phsft-bot
Copy link
Collaborator

@phsft-bot
Copy link
Collaborator

Build failed on mac11/noimt.
Running on macphsft20.dyndns.cern.ch:/Users/sftnight/build/workspace/root-pullrequests-build
See console output.

Failing tests:

@phsft-bot
Copy link
Collaborator

Build failed on ROOT-ubuntu2004/python3.
Running on root-ubuntu-2004-3.cern.ch:/home/sftnight/build/workspace/root-pullrequests-build
See console output.

Failing tests:

Copy link
Member

@pcanal pcanal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments

core/meta/src/TFunction.cxx Show resolved Hide resolved
core/metacling/src/TCling.cxx Outdated Show resolved Hide resolved
core/metacling/src/TCling.cxx Outdated Show resolved Hide resolved
core/metacling/src/TClingMethodInfo.cxx Outdated Show resolved Hide resolved
@phsft-bot
Copy link
Collaborator

Starting build on ROOT-performance-centos8-multicore/soversion, ROOT-ubuntu2204/nortcxxmod, ROOT-ubuntu2004/python3, mac11/noimt, mac12arm/cxx20, windows10/default
How to customize builds

@phsft-bot
Copy link
Collaborator

Build failed on ROOT-ubuntu2204/nortcxxmod.
See console output.

@vepadulano
Copy link
Member Author

@phsft-bot build with flags -DCTEST_TEST_EXCLUDE_NONE=ON

@phsft-bot
Copy link
Collaborator

Starting build on ROOT-performance-centos8-multicore/soversion, ROOT-ubuntu2204/nortcxxmod, ROOT-ubuntu2004/python3, mac11/noimt, mac12arm/cxx20, windows10/default with flags -DCTEST_TEST_EXCLUDE_NONE=ON
How to customize builds

@phsft-bot
Copy link
Collaborator

Build failed on ROOT-ubuntu2204/nortcxxmod.
See console output.

vepadulano and others added 4 commits September 1, 2023 15:47
TFunction is a higher-level class, it should not lock directly. As a
general rule, thread-safety should be ensured by the interfaces used.

At the same time, if a method of TFunction needs to call multiple,
different locking functions in its body, a single call to R__LOCKGUARD
is better than multiple separate calls happening in the other functions.

The strategy to improve thread-safety adopted is to move locks present
in TFunction to lower-level interfaces in TCling or TClingMethodInfo
whenever possible. In few cases, keep a lock in the TFunction method if
it needs to call multiple locking functions. Still, leave the locks in
the called functions so that they can be thread-safe interfaces for
other callers.
Even though it would be better to place the locks in TMetaUtils
directly, TClingUtils.cxx cannot depend from TInterpreter.h (for the
declaration of gInterpreterMutex) as that would cause a circular
dependency. For now, leave the locks in TClingTypeInfo.
@phsft-bot
Copy link
Collaborator

Build failed on ROOT-ubuntu2004/python3.
Running on root-ubuntu-2004-1.cern.ch:/home/sftnight/build/workspace/root-pullrequests-build
See console output.

Failing tests:

@vepadulano vepadulano changed the title Increase thread safety to fix RBatchGenerator tutorials segfaults [skip-ci] Increase thread safety to fix RBatchGenerator tutorials segfaults Sep 4, 2023
@vepadulano vepadulano changed the title [skip-ci] Increase thread safety to fix RBatchGenerator tutorials segfaults Increase thread safety to fix RBatchGenerator tutorials segfaults Sep 4, 2023
@vepadulano
Copy link
Member Author

@phsft-bot build with flags -DCTEST_TEST_EXCLUDE_NONE=ON

@phsft-bot
Copy link
Collaborator

Starting build on ROOT-performance-centos8-multicore/soversion, ROOT-ubuntu2204/nortcxxmod, ROOT-ubuntu2004/python3, mac11/noimt, mac12arm/cxx20, windows10/default with flags -DCTEST_TEST_EXCLUDE_NONE=ON
How to customize builds

@phsft-bot
Copy link
Collaborator

Build failed on ROOT-ubuntu2204/nortcxxmod.
Running on root-ubuntu-2204-1.cern.ch:/home/sftnight/build/workspace/root-pullrequests-build
See console output.

Failing tests:

@phsft-bot
Copy link
Collaborator

Build failed on ROOT-ubuntu2004/python3.
Running on root-ubuntu-2004-1.cern.ch:/home/sftnight/build/workspace/root-pullrequests-build
See console output.

Failing tests:

@phsft-bot
Copy link
Collaborator

Build failed on windows10/default.
Running on null:C:\build\workspace\root-pullrequests-build
See console output.

Failing tests:

@vepadulano
Copy link
Member Author

Re-requesting review from @pcanal after applying suggestions. The RBatchGenerator tutorials are passing on Linux and mac nodes, other failing TMVA tests are being handled separately and are not related to this PR

Copy link
Member

@pcanal pcanal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks.

@vepadulano vepadulano merged commit 93ffe95 into root-project:master Sep 8, 2023
9 of 14 checks passed
maksgraczyk pushed a commit to maksgraczyk/root that referenced this pull request Jan 12, 2024
Original message of upstream commit by Richard Smith,
llvm/llvm-project@61c7a9140b:
---
Commit to a primary definition for a class when we load its first
member.

Previously, we wouldn't do this if the first member loaded is within a
definition that's added to a class via an update record, which happens
when template instantiation adds a class definition to a declaration
that was imported from an AST file.

This would lead to classes having member functions whose getParent
returned a class declaration that wasn't the primary definition, which
in turn caused the vtable builder to build broken vtables.

I don't yet have a reduced testcase for the wrong-code bug here, because
the setup required to get us into the broken state is very subtle, but
have confirmed that this fixes it.
---

This fixes an assertion in CodeGenFunction::EmitCXXDestructorCall():
Assertion `ThisTy->getAsCXXRecordDecl() == DtorDecl->getParent() &&
"Pointer/Object mixup"' failed.
which was already seen during the upgrade to LLVM 13 in one tutorial
on CentOS 8 and "solved" by commit ffe8679 ("Relax assertion on
generating destructor call"). Due to the nature of this problem, the
assertion failure went away with unrelated changes so I reverted the
change in 2b997ad. Now the problem comes back with the upgrade to
LLVM 16 and also in master when trying to enable the RBatchGenerator
tutorials in root-project#13302, both
on macOS this time. Luckily, the underlying cause was properly fixed
in upstream LLVM just last week, so backport that commit.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants