Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing dependency or clean up in TMVA test/tutorials #16553

Open
1 task
pcanal opened this issue Sep 27, 2024 · 7 comments
Open
1 task

Missing dependency or clean up in TMVA test/tutorials #16553

pcanal opened this issue Sep 27, 2024 · 7 comments
Assignees
Labels

Comments

@pcanal
Copy link
Member

pcanal commented Sep 27, 2024

Check duplicate issues.

  • Checked for duplicates

Description

On a large node (127 cores, 128 GB), I ran:

  1. ctest -j 32
  2. ctest --rerun-failed
  3. ctest -j 32

After 1. many test failes due to lack of resources (running out of threads, see #16552 ):

47:PyMVA-Keras-Classification
348:PyMVA-Keras-Regression
349:PyMVA-Keras-Multiclass
350:gtest-tmva-pymva-test-TestRModelParserKeras
984:tutorial-tmva-TMVA_SOFIE_GNN_Application
985:tutorial-tmva-TMVA_SOFIE_Keras
986:tutorial-tmva-TMVA_SOFIE_Keras_HiggsModel
988:tutorial-tmva-TMVA_SOFIE_RDataFrame
990:tutorial-tmva-TMVA_SOFIE_RSofieReader
1238:tutorial-tmva-RBatchGenerator_PyTorch-py
1239:tutorial-tmva-RBatchGenerator_TensorFlow-py
1246:tutorial-tmva-TMVA_SOFIE_Models-py
1247:tutorial-tmva-TMVA_SOFIE_RDataFrame-py
1252:tutorial-tmva-keras-GenerateModel-py
1253:tutorial-tmva-keras-MulticlassKeras-py

However in 2., several tests still failed (even-though resources where no longer an issue):

50:gtest-tmva-pymva-test-TestRModelParserKeras
984:tutorial-tmva-TMVA_SOFIE_GNN_Application
986:tutorial-tmva-TMVA_SOFIE_Keras_HiggsModel
988:tutorial-tmva-TMVA_SOFIE_RDataFrame
990:tutorial-tmva-TMVA_SOFIE_RSofieReader
1247:tutorial-tmva-TMVA_SOFIE_RDataFrame-py

The errors listed there included:

IncrementalExecutor::executeFunction: symbol 'saxpy_' unresolved while linking [cling interface function]!
IncrementalExecutor::executeFunction: symbol 'sgemm_' unresolved while linking [cling interface function]!
tutorials/tmva/TMVA_SOFIE_RDataFrame.C:29:10: fatal error: 'Higgs_trained_model.hxx' file not found
/tutorials/tmva/TMVA_SOFIE_GNN_Application.C:10:10: fatal error: 'encoder.hxx' file not found

From this I conclude that those tests (in particular TMVA_SOFIE_RDataFrame.C and tutorials/tmva/TMVA_SOFIE_GNN_Application.C) are missing a dependencies that failed in the first run.

Note tutorial-tmva-TMVA_SOFIE_Keras_HiggsModel and tutorial-tmva-TMVA_SOFIE_RDataFrame-py are indeed needing TMVA_Higgs_Classification.C to run first (it says so in the output! :) ).

tutorial-tmva-TMVA_SOFIE_RSofieReader is asking for Higgs_trained_model.h5

gtest-tmva-pymva-test-TestRModelParserKeras is missing the symbol sgemm_ (see below)

However when rerunning (where this time somehow there was no resource related failures), I still got several failures:

346:gtest-tmva-pymva-test-TestRModelParserPyTorch
350:gtest-tmva-pymva-test-TestRModelParserKeras
984:tutorial-tmva-TMVA_SOFIE_GNN_Application
988:tutorial-tmva-TMVA_SOFIE_RDataFrame
990:tutorial-tmva-TMVA_SOFIE_RSofieReader

all due to:

IncrementalExecutor::executeFunction: symbol 'sgemm_' unresolved while linking [cling interface function]!

or both

IncrementalExecutor::executeFunction: symbol 'saxpy_' unresolved while linking [cling interface function]!
IncrementalExecutor::executeFunction: symbol 'sgemm_' unresolved while linking [cling interface function]!

Which may be due to either a badly formed result of the failing run (1) or due to an external package that does not have the correct version number?

Reproducer

ctest -j 32 # and get lots of out of resource failures
ctest --rerun-failed
ctest -j 32

ROOT version

master

Installation method

hand build

Operating system

Alma9

Additional context

jupyter-pcanal-rootdevel:quick-devel pcanal$ bin/root-config --features
cxx17 asimage builtin_clang builtin_cling builtin_gtest builtin_llvm builtin_lz4 builtin_lzma builtin_nlohmannjson builtin_openui5 builtin_tbb builtin_vdt builtin_xxhash builtin_zlib builtin_zstd clad dataframe davix gdml http imt pyroot roofit root7 rpath runtime_cxxmodules shared sqlite ssl tmva tmva-pymva tpython spectrum vdt x11 xml xrootd
@pcanal pcanal added the bug label Sep 27, 2024
@dpiparo
Copy link
Member

dpiparo commented Sep 28, 2024

Hi @pcanal , thanks for this report. Hopefully the solution will help also with fewer threads.
I am not sure though that the unresolved while linking is due to the high thread count. Can you confirm that you do not see these errors with 8-16 threads?

dpiparo added a commit to dpiparo/root that referenced this issue Sep 29, 2024
dpiparo added a commit that referenced this issue Sep 29, 2024
@pcanal
Copy link
Member Author

pcanal commented Sep 30, 2024

I am not sure though that the unresolved while linking is due to the high thread count.

I think you might be right. The best way forward is to track down where those missing symbol are suppose to come from.

@dpiparo
Copy link
Member

dpiparo commented Oct 1, 2024

Thanks for the comment. At this point this issue seems to conflate two things:

  1. The dependencies of python tests. This should have been addressed by [cmake] Correct dependencies for tutorial-tmva-TMVA_SOFIE_RDataFrame-py #16555
  2. The missing symbols.

If 1. is confirmed to be solved, I would say that at least this issue ought to be closed and one about missing symbols opened. However, even if an issue dedicated to the missing symbols is opened, it's not clear, at least to me, how the problem can be reproduced. So far we have no indication of it in our CI: can it be due to a somewhat imprecise formulation of the python dependencies in the requirements.txt file that affects your platform?

@dpiparo
Copy link
Member

dpiparo commented Oct 7, 2024

Do we have perhaps a better understanding of this issue? I understand the dependencies are now fixed. Are the symbols also cured?

@pcanal
Copy link
Member Author

pcanal commented Oct 7, 2024

For the symbol, I have waiting on input on which library those symbols are meant to come from.

@pcanal
Copy link
Member Author

pcanal commented Oct 11, 2024

So I "found" that the sgemm is explicitly meant to come from a BLAS implementation and some test seem to rely on it and still run (eventhough CMakeCache.txt knows BLAS was not found.

The following 3 tests fails consistent with missing BLAS symbols

984:tutorial-tmva-TMVA_SOFIE_GNN_Application
988:tutorial-tmva-TMVA_SOFIE_RDataFrame
990:tutorial-tmva-TMVA_SOFIE_RSofieReader

but strangely more test fails with missing BLAS symbols when run in parallels:

346:gtest-tmva-pymva-test-TestRModelParserPyTorch
350:gtest-tmva-pymva-test-TestRModelParserKeras

@pcanal
Copy link
Member Author

pcanal commented Oct 12, 2024

See the related failures created on the CI: https://github.com/root-project/root/pull/16664/checks?check_run_id=31435842971 where we run just the TMVA test to increase the chance of collisions .... and indeed the tutorial-tmva-TMVA_SOFIE_GNN_Application fails on most platforms with:

/github/home/ROOT-CI/src/tutorials/tmva/TMVA_SOFIE_GNN_Application.C:10:10: fatal error: 'encoder.hxx' file not found
#include "encoder.hxx"
         ^~~~~~~~~~~~~

and tutorial-tmva-TMVA_RNN_Classification-py fails (on just alma9-clang) due to timeout.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants