Symbol tables are non-deterministic #7

mhx · 2022-08-03T16:48:24Z

I'd appreciate if you could take a look at pull request #6.

I've noticed that symbol tables built from the exact same input data are not deterministic and I believe it is due to a bug in the makeSample function. I think the PR fixes this, but I'm not entirely sure it's the correct fix.

The text was updated successfully, but these errors were encountered:

peterboncz · 2022-11-17T11:53:09Z

Many apologies for reacting months late. My notifications were not set up correctly, so I only saw the issue when accidentally browsing here.

So this code was indeed flawed, thanks for spotting it.

The code was also not very readable, I realise. So your fix did not fully address the problem, though it was correct, would make the symbols deterministic.

What is happening here is that we are counting the frequency of occurrence of symbols, as well as of two subsequent symbols. The while loop is while(true) with the exit condition in the middle. Not very readable. This is because we do the single-symbol counting for the first symbol inside the loop (though the first symbol is found outside of it), then go on the get the next symbol and count the two-subsequent-symbols, and then in the next iteration do the single symbol counting for the symbol we just found. It is short and has less repetition, but it is not very clear, and leads to bugs.

In addition to counting single symbols and concatenations of subsequent symbols, we count more: in order to make FSST converge more quickly and more stably, in addition to always considering the next symbol (the longest symbol that matches the next bytes), we also make it consider just the next byte (i.e. a single-byte symbol). The rationale is that symbols in the various iterations get longer, but may also "die", i.e. not make it to the next round. Then, the byte pattern fully disappears from the symbol table.

Given enough sampling iterations, such value byte-patterns that "die" will be reconstructed by starting all over again from single bytes, concatenating these, etc. But we need more iterations and as we stop at some point, we may just have lost long symbols that have not had the chance to fight themselves back into the population. So, always considering these smaller symbols or smaller symbol extensions was added, and improves the compression ratio achieved somewhat.

But well, we were just reading *cur to get the first byte, and that was plain wrong of symbol code1; because at that point cur has already been increased. So we are reading the wrong byte there.

mhx · 2022-11-18T21:05:10Z

Thanks for the detailed response!

I was quite certain that I didn't understand what was going on and so my "fix" was really just fixing the symptoms and not the root cause.

Many thanks for your changes (and for the whole project of course). It turns out that the latest code actually improves the generation of the symbol tables, resulting in a (very slightly) improvement in overall compression for my use case (storing metadata lookup tables in a high-compression file system).

samansmink mentioned this issue Dec 1, 2022

Fsst pull bugfix from upstream duckdb/duckdb#5567

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Symbol tables are non-deterministic #7

Symbol tables are non-deterministic #7

mhx commented Aug 3, 2022

peterboncz commented Nov 17, 2022 •

edited

Loading

mhx commented Nov 18, 2022

Symbol tables are non-deterministic #7

Symbol tables are non-deterministic #7

Comments

mhx commented Aug 3, 2022

peterboncz commented Nov 17, 2022 • edited Loading

mhx commented Nov 18, 2022

peterboncz commented Nov 17, 2022 •

edited

Loading