Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ntuple] Fix column ID order #16621

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

jblomer
Copy link
Contributor

@jblomer jblomer commented Oct 7, 2024

Fixes the situation where the regular header uses projections and is then extended.

Fix the mirroring of the inner model in the buffered sink. On late model
extension, the buffered sink calls forwards UpdateSchema() to the inner
sink with a derived changeset. In that derived changeset, the field map
for projected fields was mistakenly using the outer models source
fields. Fixed to using the inner models source fields.
During serialization and deserialization, issue logical column IDs for
all the physical columns first and only then for the logical IDs.

Since the descriptor is built in "schema update steps", this means that
the logical column IDs of alias columns can still change during schema
construction. The physical column IDs, however, are fix upon passing a
column to the descriptor builder.
Comment on lines +110 to +114
fieldMap[p] = fInnerModel->FindField(projectedFields.GetSourceField(field)->GetQualifiedFieldName());
auto targetIt = cloned->begin();
for (auto &f : *field)
fieldMap[&(*targetIt++)] = projectedFields.GetSourceField(&f);
const_cast<RNTupleModel::RProjectedFields &>(fInnerModel->GetProjectedFields()).Add(std::move(cloned), fieldMap);
fieldMap[&(*targetIt++)] = fInnerModel->FindField(projectedFields.GetSourceField(&f)->GetQualifiedFieldName());
fInnerModel->fProjectedFields->Add(std::move(cloned), fieldMap);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have unit test coverage for this? This seems like we should have caught it in the past...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has been uncovered by the unit test modification in this PR. It's a tricky one to pin down because the previous logic is fundamentally wrong, mixing fields from different models.

That said, I'm thinking that a "column hell" test would be good. A test that combines

  • A model with a nested hierarchy of fixed-size and variable length arrays
  • Projected regular fields
  • Projected extended fields
  • Regular extended fields
  • The last two in the same and different clusters and different cluster groups
  • Multiple column representations for those columns
  • Extended column representations

That requires some thinking and may be for a different PR though.

Comment on lines 1040 to 1058
std::vector<DescriptorId_t> aliasColumns;
aliasColumns.reserve(fDescriptor.GetNLogicalColumns() - fDescriptor.GetNPhysicalColumns());
for (const auto &[_, c] : fDescriptor.fColumnDescriptors) {
if (c.IsAliasColumn())
aliasColumns.emplace_back(c.GetLogicalId());
}

for (auto id : aliasColumns) {
auto c = fDescriptor.fColumnDescriptors[id].Clone();
fDescriptor.fColumnDescriptors.erase(id);
for (auto &link : fDescriptor.fFieldDescriptors[c.fFieldId].fLogicalColumnIds) {
if (link == c.fLogicalColumnId) {
link += offset;
break;
}
}
c.fLogicalColumnId += offset;
fDescriptor.fColumnDescriptors.emplace(c.fLogicalColumnId, std::move(c));
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially this looked more complicated than needed, and especially the nested loop scared me a bit. Now I realize this only iterates over the columns of the projected field that the currently edited column references? So the inner loop has only very few iterations, and not thousands?

Otherwise, if I understand the new invariant correctly, alias columns will always have IDs fDescriptor.GetNPhysicalColumns() to fDescriptor.GetNLogicalColumns(). Then we should only need one loop over those IDs and adjust them accordingly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially this looked more complicated than needed, and especially the nested loop scared me a bit. Now I realize this only iterates over the columns of the projected field that the currently edited column references? So the inner loop has only very few iterations, and not thousands?

Yes

Otherwise, if I understand the new invariant correctly, alias columns will always have IDs fDescriptor.GetNPhysicalColumns() to fDescriptor.GetNLogicalColumns(). Then we should only need one loop over those IDs and adjust them accordingly.

Good point! Done (and uncovered a bug along the way: we need to ensure that the columns are moved from top to bottom. The previous version ended up doing this by chance [unordered map iteration]).

Comment on lines +753 to +759
std::uint32_t nNewPhysicalColumns = 0;
for (auto f : changeset.fAddedFields) {
nNewPhysicalColumns += getNColumns(*f);
for (const auto &descendant : *f)
nNewPhysicalColumns += getNColumns(descendant);
}
fDescriptorBuilder.ShiftAliasColumns(nNewPhysicalColumns);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to ask about this again, I forgot the answer: Do we already support adding new column representations to existing fields? Do we want to support this eventually?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is supported by the s11n and tested here: https://github.com/root-project/root/blob/master/tree/ntuple/v7/test/ntuple_serialize.cxx#L1402

But not yet used by the merger.

model->AddProjectedField(std::make_unique<RField<float>>("aliasE"), [](const std::string &) { return "E"; });

RNTupleWriteOptions options;
options.SetUseBufferedWrite(false);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This disables buffered writing, probably not to trigger the bug fixed in the first commit? We should probably extend our unit testing coverage for late schema extension...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I forgot to remove it... done now.

@hahnjo
Copy link
Member

hahnjo commented Oct 7, 2024

Thinking about this again, would it maybe help to keep projected fields and alias columns separate when building the descriptor and only assign the column IDs when once all physical column IDs are known?

@jblomer
Copy link
Contributor Author

jblomer commented Oct 7, 2024

Thinking about this again, would it maybe help to keep projected fields and alias columns separate when building the descriptor and only assign the column IDs when once all physical column IDs are known?

For reading that may work. For writing, the problem is that we need to write out the header (including the projected fields) before we know about possible model extensions.

@jblomer jblomer requested a review from hahnjo October 7, 2024 15:31
Copy link

github-actions bot commented Oct 7, 2024

Test Results

    14 files      14 suites   3d 15h 52m 4s ⏱️
 2 712 tests  2 711 ✅ 0 💤 1 ❌
35 706 runs  35 705 ✅ 0 💤 1 ❌

For more details on these failures, see this check.

Results for commit d98219a.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants