Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Symbol tables are non-deterministic #7

Open
mhx opened this issue Aug 3, 2022 · 2 comments
Open

Symbol tables are non-deterministic #7

mhx opened this issue Aug 3, 2022 · 2 comments

Comments

@mhx
Copy link

mhx commented Aug 3, 2022

I'd appreciate if you could take a look at pull request #6.

I've noticed that symbol tables built from the exact same input data are not deterministic and I believe it is due to a bug in the makeSample function. I think the PR fixes this, but I'm not entirely sure it's the correct fix.

@peterboncz
Copy link
Collaborator

peterboncz commented Nov 17, 2022

Many apologies for reacting months late. My notifications were not set up correctly, so I only saw the issue when accidentally browsing here.

So this code was indeed flawed, thanks for spotting it.

The code was also not very readable, I realise. So your fix did not fully address the problem, though it was correct, would make the symbols deterministic.

What is happening here is that we are counting the frequency of occurrence of symbols, as well as of two subsequent symbols. The while loop is while(true) with the exit condition in the middle. Not very readable. This is because we do the single-symbol counting for the first symbol inside the loop (though the first symbol is found outside of it), then go on the get the next symbol and count the two-subsequent-symbols, and then in the next iteration do the single symbol counting for the symbol we just found. It is short and has less repetition, but it is not very clear, and leads to bugs.

In addition to counting single symbols and concatenations of subsequent symbols, we count more: in order to make FSST converge more quickly and more stably, in addition to always considering the next symbol (the longest symbol that matches the next bytes), we also make it consider just the next byte (i.e. a single-byte symbol). The rationale is that symbols in the various iterations get longer, but may also "die", i.e. not make it to the next round. Then, the byte pattern fully disappears from the symbol table.

Given enough sampling iterations, such value byte-patterns that "die" will be reconstructed by starting all over again from single bytes, concatenating these, etc. But we need more iterations and as we stop at some point, we may just have lost long symbols that have not had the chance to fight themselves back into the population. So, always considering these smaller symbols or smaller symbol extensions was added, and improves the compression ratio achieved somewhat.

But well, we were just reading *cur to get the first byte, and that was plain wrong of symbol code1; because at that point cur has already been increased. So we are reading the wrong byte there.

@mhx
Copy link
Author

mhx commented Nov 18, 2022

Thanks for the detailed response!

I was quite certain that I didn't understand what was going on and so my "fix" was really just fixing the symptoms and not the root cause.

Many thanks for your changes (and for the whole project of course). It turns out that the latest code actually improves the generation of the symbol tables, resulting in a (very slightly) improvement in overall compression for my use case (storing metadata lookup tables in a high-compression file system).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants