Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

preserve stucture of timezone designation list #5581

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

petrosagg
Copy link

The intention of this PR is to maintain the structure of the binary data in the DataBlock struct. The current code produces a list of timezone designations that has the same length as the list of local time records with the i-th timezone designation corresponding to the i-th local time record. In this context the idx field of the local time record does not correspond to a position in the timezone designation list.

In this alternative implementation the timezone designation list directly corresponds to the null separated list of designations in the binary data. Since the idx field of each local time record can refer anywhere in that string, even in the middle of a designation, a helper method is added so that the correct substring is returned.

This allows the DataBlock to capture the data exactly and also potentially re-encode them without loss.

@CLAassistant
Copy link

CLAassistant commented Sep 23, 2024

CLA assistant check
All committers have signed the CLA.

@robertbastian robertbastian marked this pull request as ready for review September 24, 2024 08:28
@robertbastian robertbastian requested a review from a team as a code owner September 24, 2024 08:28
@robertbastian robertbastian marked this pull request as draft September 24, 2024 08:29
@petrosagg petrosagg marked this pull request as ready for review September 24, 2024 10:01
@Manishearth
Copy link
Member

r? @sffc @nekevss

@sffc
Copy link
Member

sffc commented Sep 24, 2024

@nordzilla Can you take a look? It is in the tzif crate.

@sffc sffc removed their request for review September 24, 2024 18:08
Copy link
Member

@nordzilla nordzilla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⭐ Praise

@petrosagg, thanks for taking the time to sift through the parsing code, diagnose a potential issue, and submit a pull request.


Background

It's been over two years since I wrote this code, so I had to take a moment to get back up to speed on everything.

I see what you're wanting to do here and understand why, but I'm not sure I fully agree with the changes, at least in terms of ICU4X's use case. I also don't think the implementation here is quite what you intended it to be.

I'm going to address the following in this review comment:

  1. Explain the rationale behind the original implementation.
  2. Discuss the rationale behind your new implementation.
  3. Discuss why the current implementation is not quite correct, along with would need to happen to make the implementation correct.
  4. Open the floor for further discussion about what to do next.

The Original Implementation

TZif is a compressed binary format that utilizes self-referential indices for these time-zone designations.

Consider the string "LMT\0AEDT\0AEST\0".

The compressed format contains only this single string, and defines indices into that string, which read until the next null terminator character b'\0'.

  • The index 0 would refer to LMT.
  • The index 4 would refer to AEDT.
  • The index 9 would refer to AEST.

The compression goes even further, by allowing for an index to start in the middle of a string.

  • The index 5 would refer to EDT.
  • The index 10 would refer to EST.
  • (Hypothetically) the index 12 would refer to T.

The purpose for this compression in TZif is because there can be many records that share the same string. There may be five records that all map to the substring section AEDT.

Since performance, allocations, runtime, etc. are not critical concerns for the data-generation side of ICU4X, I made the choice to unroll all of these strings into a Vec that is a 1:1 mapping with the records.

Consider the following test case that corresponds to the explanation above:

    #[test]
    fn parse_time_zone_designations() {
        assert_parse_eq!(
            time_zone_designations(
                14,
                vec![
                    LocalTimeTypeRecord {
                        idx: 0,
                        ..Default::default()
                    },
                    LocalTimeTypeRecord {
                        idx: 4,
                        ..Default::default()
                    },
                    LocalTimeTypeRecord {
                        idx: 9,
                        ..Default::default()
                    },
                    LocalTimeTypeRecord {
                        idx: 5,
                        ..Default::default()
                    },
                    LocalTimeTypeRecord {
                        idx: 10,
                        ..Default::default()
                    },
                    LocalTimeTypeRecord {
                        idx: 12,
                        ..Default::default()
                    },
                    LocalTimeTypeRecord {
                        idx: 0,
                        ..Default::default()
                    },
                ]
            ),
            "LMT\0AEDT\0AEST\0",
            vec![
                String::from("LMT"),
                String::from("AEDT"),
                String::from("AEST"),
                String::from("EDT"),
                String::from("EST"),
                String::from("T"),
                String::from("LMT"),
            ],
        );
    }

In this implementation, the 0 index local time type record (of the vector, not the idx value) maps to the 0 index time zone designation string from its vector, and so on.

This makes it easy to zip together the two collections, pulling the offset from the record and matching it with the string.

It does, however, mean that the same string may be allocated multiple times in the parsed data.


The Implementation In This PR

@petrosagg it looks like you noticed that the idx values in the parsed local time type records no longer correspond to anything meaningful: the index 9 doesn't mean anything anymore, because that is the index into the compressed TZif string.

It looks like what you've attempted here is to preserve the compressed TZif format, making each idx value mean something again within the source data.

The implementation, however, is stuck half way between.

You're still splitting the string up into a Vec<String> delimited by the null-terminator character:

    count_min_max(charcnt, charcnt, any()).map(|bytes: Vec<u8>| {
        bytes
            .split_inclusive(|b| *b == 0)
            .map(|s| String::from_utf8_lossy(&s[0..s.len() - 1]).into_owned())
            .collect()
    })

But your helper function treats the idx value, which is an index into the original byte string with multiple null terminators, instead as an index into the first string in the vector that is long enough for the index.

impl DataBlock {
    /// Retrieves the timezone designation at index `idx`.
    pub fn time_zone_designation(&self, mut idx: usize) -> Option<&str> {
        self.time_zone_designations.iter().find_map(|d| {
            if idx <= d.len() {
                Some(&d[idx..])
            } else {
                idx -= d.len() + 1;
                None
            }
        })
    }
}

How to Make This Implementation Correct

In order to fully achieve what you're trying to do here, you would have to preserve the original string "LMT\0AEDT\0AEST\0" (not a vector of separated strings), and then the indices would truly match to something meaningful again.

A Discussion About How To Move Forward

I'm still not convinced that reverting back to a compressed string is the best path here.

Perhaps @petrosagg could share more about their use case, or the reason why they want this.

@sffc could you read this over, let me know if anything feels confusing or needs further clarification, and give your input from an ICU4X-maintainer perspective?

utils/tzif/src/data/tzif.rs Show resolved Hide resolved
utils/tzif/src/data/tzif.rs Show resolved Hide resolved
utils/tzif/src/data/tzif.rs Show resolved Hide resolved
@sffc sffc self-requested a review September 24, 2024 21:47
@petrosagg
Copy link
Author

Perhaps @petrosagg could share more about their use case, or the reason why they want this.

Yeah, definitely! I'm porting porting a medium size C datetime parsing library in Rust piece by piece. I'm currently attempting to replace the code that loads a tzif file into memory which is what this library is aiming to do as well. In order to ensure that I'm not accidentally altering the behavior of the library I'm writing fuzzing tests that compare the two implementations. However, for that to work correctly I want to be able to load the tzif file in an identical way.

The unrolling of timezone designations that happens in this library prevents me from doing so since some information is lost during parsing about how the timezone list was laid out. So this PR is maintaining the ability of determining the correct timezone designation in all the edge cases that we discussed in the other comment while also preserving more of the original structure of the file.

An additional benefit, which I don't need right now but I find beneficial, is that when a tzif file is parsed with the current code it is no longer in a form that can be serialized back into disk in a correct way. In order to do so all the index fields of each timezone designation would have to be re-written and the list of designations would also have to be deduplicated. This complication can be avoided if we keep the original structure as done in this PR, which turns serialization back into a tzif file into a simple process of writing out the fields one by one.

@nordzilla
Copy link
Member

nordzilla commented Sep 25, 2024

Hey @petrosagg

Thanks for all the further explanation and discussion!

I had misinterpreted your code in an attempt to finish this review quickly before the end of my day.

I apologize for that!


I do believe that the code/implementation is correct.

After thinking about it more last night, I was going to suggest that we could just remove the idx property and have the local time type records be parsed to just directly contain their string.

But now it's more clear that there is value for you in being able to round-trip the parsed data back into the original format (which is not a requirement for ICU4X's use case).

I have no strong desire to keep the old code if it is more useful to the Rust community to have it structured like this. Using it for ICU4X data generation would only be slightly different.


I'm happy to accept this code if it's fine with @sffc. I haven't been an active maintainer/contributor to ICU4X for a while, so I feel better delegating that final decision.

This would be a breaking change to the tzif crate and would require a version bump.


In terms of further review, I would like to see a slightly more detailed explanation of the helper function, more than just the one-liner: /// Retrieves the timezone designation at index idx.

It would be helpful to mention what the original format looks like, that the vector of strings are preserved in the original order, and that is why subtracting from the idx value makes sense given the original format that was delimited with null terminators.

That will help prevent people like me from misinterpreting the function in the future 😅

@nordzilla nordzilla self-requested a review September 25, 2024 14:16
time_zone_designations
count_min_max(charcnt, charcnt, any()).map(|bytes: Vec<u8>| {
bytes
.split_inclusive(|b| *b == 0)
Copy link
Member

@nordzilla nordzilla Sep 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤏 nitpick: style

I would prefer to see *b == b'\0' here to preserve the relationship to the idea that this acts like a C-style null terminator character.

@robertbastian robertbastian added the waiting-on-author PRs waiting for action from the author for >7 days label Oct 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
waiting-on-author PRs waiting for action from the author for >7 days
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants