Does daft use all cores by default? #2611

pawarbi · 2024-08-03T18:40:23Z

pawarbi
Aug 3, 2024

I am running some tests in Fabric with and without a ray cluster. Am I correct that default daft uses a single core (just like pandas) and daft + ray uses all available cores (by default). Like spark, do I need to partition the data to utilize the cores effectively ? (I did with and without partitioning and saw a big improvement).

jaychia · 2024-08-03T19:55:59Z

jaychia
Aug 3, 2024
Maintainer

I am running some tests in Fabric with and without a ray cluster. Am I correct that default daft uses a single core (just like pandas) and daft + ray uses all available cores (by default). Like spark, do I need to partition the data to utilize the cores effectively ? (I did with and without partitioning and saw a big improvement).

Daft by default (without Ray) already uses all cores. Today, it does require partitioning in order to do so more effectively, but Daft will already make efficient use of CPUs for tasks like parallel data loading. We are working on a new execution model that make this even more efficient without partitioning (will be released in about a month or so).

Daft + Ray also uses all cores, only if your data is partitioned. The main difference is that this mode also allows for out-of-core processing, and distributed processing. Using Ray introduces some overhead, but the overhead is worth it if the data is large.

In summary, because we optimized for distributed computing, today Daft relies on partitioning to efficiently utilize resources on both the local and Ray runners. We have a new execution engine in the works which will make this easier without partitioning.

3 replies

pawarbi Aug 4, 2024
Author

Thanks @jaychia . Makes sense. I did partition in my tests but didn't see much improvement, I will test again. If you have any suggestions, please let me know.

https://fabric.guru/quick-test-daft-with-ray-in-fabric

jaychia Aug 5, 2024
Maintainer

Yes indeed, in this case your data is likely already big-enough that Daft will automatically partition it when reading it in. Thus you may not be seeing a large gain in performance because your table was already correctly partitioned in the first place :)

jaychia Aug 5, 2024
Maintainer

One way you can check is to run df.explain(True) to take a look at the physical plan, and inspect the number of partitions

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does daft use all cores by default? #2611

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Does daft use all cores by default? #2611

pawarbi Aug 3, 2024

Replies: 1 comment · 3 replies

jaychia Aug 3, 2024 Maintainer

pawarbi Aug 4, 2024 Author

jaychia Aug 5, 2024 Maintainer

jaychia Aug 5, 2024 Maintainer

pawarbi
Aug 3, 2024

Replies: 1 comment 3 replies

jaychia
Aug 3, 2024
Maintainer

pawarbi Aug 4, 2024
Author

jaychia Aug 5, 2024
Maintainer

jaychia Aug 5, 2024
Maintainer