[ntuple] Fix column ID order #16621

jblomer · 2024-10-07T12:45:12Z

Fixes the situation where the regular header uses projections and is then extended.

Fix the mirroring of the inner model in the buffered sink. On late model extension, the buffered sink calls forwards UpdateSchema() to the inner sink with a derived changeset. In that derived changeset, the field map for projected fields was mistakenly using the outer models source fields. Fixed to using the inner models source fields.

During serialization and deserialization, issue logical column IDs for all the physical columns first and only then for the logical IDs. Since the descriptor is built in "schema update steps", this means that the logical column IDs of alias columns can still change during schema construction. The physical column IDs, however, are fix upon passing a column to the descriptor builder.

hahnjo · 2024-10-07T12:55:17Z

tree/ntuple/v7/src/RPageSinkBuf.cxx

+      fieldMap[p] = fInnerModel->FindField(projectedFields.GetSourceField(field)->GetQualifiedFieldName());
      auto targetIt = cloned->begin();
      for (auto &f : *field)
-         fieldMap[&(*targetIt++)] = projectedFields.GetSourceField(&f);
-      const_cast<RNTupleModel::RProjectedFields &>(fInnerModel->GetProjectedFields()).Add(std::move(cloned), fieldMap);
+         fieldMap[&(*targetIt++)] = fInnerModel->FindField(projectedFields.GetSourceField(&f)->GetQualifiedFieldName());
+      fInnerModel->fProjectedFields->Add(std::move(cloned), fieldMap);


Can we have unit test coverage for this? This seems like we should have caught it in the past...

It has been uncovered by the unit test modification in this PR. It's a tricky one to pin down because the previous logic is fundamentally wrong, mixing fields from different models.

That said, I'm thinking that a "column hell" test would be good. A test that combines

A model with a nested hierarchy of fixed-size and variable length arrays

Projected regular fields

Projected extended fields

Regular extended fields

The last two in the same and different clusters and different cluster groups

Multiple column representations for those columns

Extended column representations

That requires some thinking and may be for a different PR though.

hahnjo · 2024-10-07T13:25:29Z

tree/ntuple/v7/src/RNTupleDescriptor.cxx

+   std::vector<DescriptorId_t> aliasColumns;
+   aliasColumns.reserve(fDescriptor.GetNLogicalColumns() - fDescriptor.GetNPhysicalColumns());
+   for (const auto &[_, c] : fDescriptor.fColumnDescriptors) {
+      if (c.IsAliasColumn())
+         aliasColumns.emplace_back(c.GetLogicalId());
+   }
+
+   for (auto id : aliasColumns) {
+      auto c = fDescriptor.fColumnDescriptors[id].Clone();
+      fDescriptor.fColumnDescriptors.erase(id);
+      for (auto &link : fDescriptor.fFieldDescriptors[c.fFieldId].fLogicalColumnIds) {
+         if (link == c.fLogicalColumnId) {
+            link += offset;
+            break;
+         }
+      }
+      c.fLogicalColumnId += offset;
+      fDescriptor.fColumnDescriptors.emplace(c.fLogicalColumnId, std::move(c));
+   }


Initially this looked more complicated than needed, and especially the nested loop scared me a bit. Now I realize this only iterates over the columns of the projected field that the currently edited column references? So the inner loop has only very few iterations, and not thousands?

Otherwise, if I understand the new invariant correctly, alias columns will always have IDs fDescriptor.GetNPhysicalColumns() to fDescriptor.GetNLogicalColumns(). Then we should only need one loop over those IDs and adjust them accordingly.

Initially this looked more complicated than needed, and especially the nested loop scared me a bit. Now I realize this only iterates over the columns of the projected field that the currently edited column references? So the inner loop has only very few iterations, and not thousands?

Yes

Otherwise, if I understand the new invariant correctly, alias columns will always have IDs fDescriptor.GetNPhysicalColumns() to fDescriptor.GetNLogicalColumns(). Then we should only need one loop over those IDs and adjust them accordingly.

Good point! Done (and uncovered a bug along the way: we need to ensure that the columns are moved from top to bottom. The previous version ended up doing this by chance [unordered map iteration]).

hahnjo · 2024-10-07T13:26:58Z

tree/ntuple/v7/src/RPageStorage.cxx

+      std::uint32_t nNewPhysicalColumns = 0;
+      for (auto f : changeset.fAddedFields) {
+         nNewPhysicalColumns += getNColumns(*f);
+         for (const auto &descendant : *f)
+            nNewPhysicalColumns += getNColumns(descendant);
+      }
+      fDescriptorBuilder.ShiftAliasColumns(nNewPhysicalColumns);


Sorry to ask about this again, I forgot the answer: Do we already support adding new column representations to existing fields? Do we want to support this eventually?

Yes, it is supported by the s11n and tested here: https://github.com/root-project/root/blob/master/tree/ntuple/v7/test/ntuple_serialize.cxx#L1402

But not yet used by the merger.

hahnjo · 2024-10-07T13:27:46Z

tree/ntuple/v7/test/ntuple_modelext.cxx

+      model->AddProjectedField(std::make_unique<RField<float>>("aliasE"), [](const std::string &) { return "E"; });
+
+      RNTupleWriteOptions options;
+      options.SetUseBufferedWrite(false);


This disables buffered writing, probably not to trigger the bug fixed in the first commit? We should probably extend our unit testing coverage for late schema extension...

Thanks! I forgot to remove it... done now.

hahnjo · 2024-10-07T14:00:06Z

Thinking about this again, would it maybe help to keep projected fields and alias columns separate when building the descriptor and only assign the column IDs when once all physical column IDs are known?

jblomer · 2024-10-07T14:05:22Z

Thinking about this again, would it maybe help to keep projected fields and alias columns separate when building the descriptor and only assign the column IDs when once all physical column IDs are known?

For reading that may work. For writing, the problem is that we need to write out the header (including the projected fields) before we know about possible model extensions.

github-actions · 2024-10-07T17:20:29Z

Test Results

14 files 14 suites 3d 15h 52m 4s ⏱️
2 712 tests 2 711 ✅ 0 💤 1 ❌
35 706 runs 35 705 ✅ 0 💤 1 ❌

For more details on these failures, see this check.

Results for commit d98219a.

jblomer added the in:RNTuple label Oct 7, 2024

jblomer requested review from hahnjo, pcanal, silverweed and enirolf October 7, 2024 12:45

jblomer self-assigned this Oct 7, 2024

jblomer force-pushed the fix-alias-column-ids branch from 5de7199 to d42b730 Compare October 7, 2024 12:50

jblomer added 3 commits October 7, 2024 14:50

[NFC][ntuple] fix column ID order in the specification

06700aa

hahnjo reviewed Oct 7, 2024

View reviewed changes

[ntuple] fix-up in test

acdb92e

[ntuple] simplify/fix ShiftAliasColumns()

d98219a

jblomer requested a review from hahnjo October 7, 2024 15:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ntuple] Fix column ID order #16621

[ntuple] Fix column ID order #16621

jblomer commented Oct 7, 2024

hahnjo Oct 7, 2024

jblomer Oct 7, 2024

hahnjo Oct 7, 2024

jblomer Oct 7, 2024

hahnjo Oct 7, 2024

jblomer Oct 7, 2024

hahnjo Oct 7, 2024

jblomer Oct 7, 2024

hahnjo commented Oct 7, 2024

jblomer commented Oct 7, 2024

github-actions bot commented Oct 7, 2024

[ntuple] Fix column ID order #16621

Are you sure you want to change the base?

[ntuple] Fix column ID order #16621

Conversation

jblomer commented Oct 7, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hahnjo commented Oct 7, 2024

jblomer commented Oct 7, 2024

github-actions bot commented Oct 7, 2024

Test Results