FastLanes

We think it is time for a new data format; going beyond Parquet (and ORC). The existing data formats have been very successful and form the basis of data lakes and lakehouse architectures. Yet, they are 15 years old and very hard to evolve, for various reasons. There are two main reasons to evolve them and these form the motivation behind FastLanes:

it is possible to provide significantly better compression and better access speeds on current workloads.
new workloads have emerged, particularly, data engineering pipelines for machine learning (ML).

In Data Lakes, there is a much reduced role for database design, as there are no database administrators and applications often emerge after data gets collected. This yields many situations where data ends up being stored in sub-optimal formats. Simple examples are using string datatypes for data that is numeric or timestamp (and the majority of data is string), a complex example is redundancy in data, e.g., due to denormalization. We think that compression ratio is one area where file formats can be improved. Further, improved access speed can be obtained by letting data consumers operate on (partly) compressed data. This means that the API of the data format needs to be more flexible.

ML workloads often have very wide tables with many features. These can sometimes be dense high-dimensional floating-point vectors, and other times be very sparse, such that storing features in maps becomes attractive. Wide and sparse columns using maps and lists get to be more common. We also think the established (Data Lake) and new (ML) workloads can leverage modern hardware better. On the CPU side, it is critical to use SIMD instructions effectively. ML pipelines very often run on GPUs, which have less memory and much less cache memory than CPUs, and GPU cores are not efficient on complex and branchy codecs like general-purpose decompressors (LZ4, zstd). Note that GPUs and SIMD have a lot in common: both excel when there is (i) a lot of data-parallelism and (ii) absence of branch control-flow.

Some key ideas in FastLanes:

a layout design that is highly data-parallel. FastLanes on Intel CPUs can bit-unpack at 60 values per CPU cycle (per core).
separation between the logical table format the application expects, and a physical data format in which row-groups get stored.
cascading compression "expressions" that achieve very high compression ratios without having to use general-purpose codecs.
specific compression schemes for nested data (lists, structs, maps).
efficient data-parallel predicate pushdown.
read support for compressed vectors (batches), such as FOR-vectors, RLE-vectors, FSST-vectors and DICT-vectors.

FastLanes is still in its early days, but we think we have an excellent foundation. It is open-source and would like to create a vibrant community around it.

Join Us on Discord

Join our Discord server!

Publications :

The FastLanes Compression Layout:Decoding >100 Billion Integers per Second with Scalar Code
- source code

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.github/workflows		.github/workflows
benchmark		benchmark
data		data
example		example
fls_gen		fls_gen
gpu		gpu
include		include
log		log
primitives		primitives
publications/data_parallelized_encodings		publications/data_parallelized_encodings
scripts		scripts
src		src
test		test
third_party		third_party
toolchains		toolchains
.clang-format		.clang-format
.clang-tidy		.clang-tidy
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
convention.md		convention.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FastLanes

Join Us on Discord

Publications :

About

Releases

Packages

Languages

License

cwida/FastLanes

Folders and files

Latest commit

History

Repository files navigation

FastLanes

Join Us on Discord

Publications :

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages