How does Vortex compare to Lance? #1226

philippemnoel · 2024-11-06T00:52:36Z

philippemnoel
Nov 6, 2024

Hey everyone! Phil here from @paradedb. We're pretty interested in a Parquet/Arrow successor, and were previously considering Lance for the fast random access read. Could you please share how Vortex compares in your own words? When should one consider one vs the other.

gatesn · 2024-11-06T15:41:20Z

gatesn
Nov 6, 2024
Maintainer

Hi Phil, great question, we’ve been following along with Lance for a while now, some neat stuff.

Of course, please caveat this answer that my understanding of Lance isn’t as deep as Vortex!

At a high level, you can think of Lance as covering the architectural "boxes" of Iceberg + Parquet, whereas Vortex is more of a pure replacement for Parquet or ORC.

Lance provides dataset semantics (atomicity, versioning, etc.) as well as additional features to support building vector indices against those datasets.

Internally, the Lance V1 format originally provided fast random access by simply storing uncompressed Arrow. That’s no longer the case, and their V2 format recently added FSST and our FastLanes BitPacking to compress strings & integers, respectively. That said, their compression implementation is less complete: for example there is no float compression such as Vortex ALP, and Lance doesn't currently support cascading compression codecs.

Notably, one of the things that makes Vortex special is that it defines its own in-memory format for compressed Arrays (think of it as compressed Arrow), opening up all sorts of potential compute engine optimizations.

And by drawing Vortex's architectural box this way, Vortex (or its component parts) can be reused more easily in other systems (e.g., we'd like to get Vortex support into Iceberg eventually). In the same way that DataFusion is a useful compute library for many Rust projects (e.g. Influx, and I believe yourselves), Vortex should be seen as a generally useful storage library with reusable components.

For example, vortex-dtype provides a general purpose logical type system with vortex-scalar accompanying it to provide serializable scalars. And we are also currently working on an I/O layer that will optimize itself based on real-time observations of latency and bandwidth, performing equally well for local disk, NAS, and object storage. This should be re-usable for any project looking to read byte ranges from files.

We need to clean up the benchmarks a bit before formally publishing them, but Vortex currently has ~2x the write throughput of Parquet v2 with zstd, ~2-5x the full scan throughput, and ~200x faster random access reads, while typically being the same approximate size (high variance, +/- 50%, but median is probably ~10% bigger than Parquet). I'm not sure about the latest performance numbers for Lance, but at the very least, I would expect their storage size to be much larger (2-10x for most datasets).

In terms of which to use, I would say if this is internal to another data-oriented project, then definitely give Vortex a go. I’m sure there will be some bumps, lack of docs, and other issues common with early stage projects, but do let us know! If you are an end-user, for example working in Python and performing ML/search oriented tasks, then Lance is your best bet for now.

1 reply

philippemnoel Nov 6, 2024
Author

Thanks for explaining, this is super useful!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How does Vortex compare to Lance? #1226

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

How does Vortex compare to Lance? #1226

philippemnoel Nov 6, 2024

Replies: 1 comment · 1 reply

gatesn Nov 6, 2024 Maintainer

philippemnoel Nov 6, 2024 Author

philippemnoel
Nov 6, 2024

Replies: 1 comment 1 reply

gatesn
Nov 6, 2024
Maintainer

philippemnoel Nov 6, 2024
Author