-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ntuple] Fix column ID order #16621
base: master
Are you sure you want to change the base?
[ntuple] Fix column ID order #16621
Conversation
5de7199
to
d42b730
Compare
Fix the mirroring of the inner model in the buffered sink. On late model extension, the buffered sink calls forwards UpdateSchema() to the inner sink with a derived changeset. In that derived changeset, the field map for projected fields was mistakenly using the outer models source fields. Fixed to using the inner models source fields.
During serialization and deserialization, issue logical column IDs for all the physical columns first and only then for the logical IDs. Since the descriptor is built in "schema update steps", this means that the logical column IDs of alias columns can still change during schema construction. The physical column IDs, however, are fix upon passing a column to the descriptor builder.
fieldMap[p] = fInnerModel->FindField(projectedFields.GetSourceField(field)->GetQualifiedFieldName()); | ||
auto targetIt = cloned->begin(); | ||
for (auto &f : *field) | ||
fieldMap[&(*targetIt++)] = projectedFields.GetSourceField(&f); | ||
const_cast<RNTupleModel::RProjectedFields &>(fInnerModel->GetProjectedFields()).Add(std::move(cloned), fieldMap); | ||
fieldMap[&(*targetIt++)] = fInnerModel->FindField(projectedFields.GetSourceField(&f)->GetQualifiedFieldName()); | ||
fInnerModel->fProjectedFields->Add(std::move(cloned), fieldMap); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have unit test coverage for this? This seems like we should have caught it in the past...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It has been uncovered by the unit test modification in this PR. It's a tricky one to pin down because the previous logic is fundamentally wrong, mixing fields from different models.
That said, I'm thinking that a "column hell" test would be good. A test that combines
- A model with a nested hierarchy of fixed-size and variable length arrays
- Projected regular fields
- Projected extended fields
- Regular extended fields
- The last two in the same and different clusters and different cluster groups
- Multiple column representations for those columns
- Extended column representations
That requires some thinking and may be for a different PR though.
std::vector<DescriptorId_t> aliasColumns; | ||
aliasColumns.reserve(fDescriptor.GetNLogicalColumns() - fDescriptor.GetNPhysicalColumns()); | ||
for (const auto &[_, c] : fDescriptor.fColumnDescriptors) { | ||
if (c.IsAliasColumn()) | ||
aliasColumns.emplace_back(c.GetLogicalId()); | ||
} | ||
|
||
for (auto id : aliasColumns) { | ||
auto c = fDescriptor.fColumnDescriptors[id].Clone(); | ||
fDescriptor.fColumnDescriptors.erase(id); | ||
for (auto &link : fDescriptor.fFieldDescriptors[c.fFieldId].fLogicalColumnIds) { | ||
if (link == c.fLogicalColumnId) { | ||
link += offset; | ||
break; | ||
} | ||
} | ||
c.fLogicalColumnId += offset; | ||
fDescriptor.fColumnDescriptors.emplace(c.fLogicalColumnId, std::move(c)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initially this looked more complicated than needed, and especially the nested loop scared me a bit. Now I realize this only iterates over the columns of the projected field that the currently edited column references? So the inner loop has only very few iterations, and not thousands?
Otherwise, if I understand the new invariant correctly, alias columns will always have IDs fDescriptor.GetNPhysicalColumns()
to fDescriptor.GetNLogicalColumns()
. Then we should only need one loop over those IDs and adjust them accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initially this looked more complicated than needed, and especially the nested loop scared me a bit. Now I realize this only iterates over the columns of the projected field that the currently edited column references? So the inner loop has only very few iterations, and not thousands?
Yes
Otherwise, if I understand the new invariant correctly, alias columns will always have IDs
fDescriptor.GetNPhysicalColumns()
tofDescriptor.GetNLogicalColumns()
. Then we should only need one loop over those IDs and adjust them accordingly.
Good point! Done (and uncovered a bug along the way: we need to ensure that the columns are moved from top to bottom. The previous version ended up doing this by chance [unordered map iteration]).
std::uint32_t nNewPhysicalColumns = 0; | ||
for (auto f : changeset.fAddedFields) { | ||
nNewPhysicalColumns += getNColumns(*f); | ||
for (const auto &descendant : *f) | ||
nNewPhysicalColumns += getNColumns(descendant); | ||
} | ||
fDescriptorBuilder.ShiftAliasColumns(nNewPhysicalColumns); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry to ask about this again, I forgot the answer: Do we already support adding new column representations to existing fields? Do we want to support this eventually?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is supported by the s11n and tested here: https://github.com/root-project/root/blob/master/tree/ntuple/v7/test/ntuple_serialize.cxx#L1402
But not yet used by the merger.
model->AddProjectedField(std::make_unique<RField<float>>("aliasE"), [](const std::string &) { return "E"; }); | ||
|
||
RNTupleWriteOptions options; | ||
options.SetUseBufferedWrite(false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This disables buffered writing, probably not to trigger the bug fixed in the first commit? We should probably extend our unit testing coverage for late schema extension...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I forgot to remove it... done now.
Thinking about this again, would it maybe help to keep projected fields and alias columns separate when building the descriptor and only assign the column IDs when once all physical column IDs are known? |
For reading that may work. For writing, the problem is that we need to write out the header (including the projected fields) before we know about possible model extensions. |
Test Results 14 files 14 suites 3d 15h 52m 4s ⏱️ For more details on these failures, see this check. Results for commit d98219a. |
Fixes the situation where the regular header uses projections and is then extended.