Replies: 1 comment
-
Discussion ready to be ported to issue now as we decided that implementation of custom Daft types will be backed by Arrow extensions. Python types will remain as a purely in-memory abstraction for ease of data processing. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
[RFC] Arrow/Py/Daft Expression Types
This document is a proposal for extending the Expression typing system in Daft.
RFC Summary
After the implementation of this RFC, Daft's typing system will look like the following:
a) Arrow native types (int64, float64, list[int64] etc)
a) Daft Arrow Extension types (image, audio, video, latlong etc)
Daft only supports serializing Arrow types for writing to disk and long-term storage. For storing Python types, users will first have to marshal data (e.g.
DaftImage.from_pil(df["pil_image"])
) into the appropriate Arrow type before leveraging Daft's tooling for saving the data.RFC Details
Motivation
Expression Types in Daft serve the following purposes:
Daft currently has 2 main types:
Problem
This typing system has the following shortcomings:
Solution
A solution needs to provide:
a) Should have a serializable in-memory format (e.g. Arrow, protobuf, flatbuffer)
b) Ability to define methods/kernels on these complex types for domain-specific functionality (e.g. image resizing)
c) Ability to marshal to/from common Python representations such as numpy, PIL etc for custom processing
d) Visualization logic for complex types during interactive development
User Experience
To a user, these types will be displayed on a DataFrame as follows:
int64
,list[int64]
etcImage
,list[Image]
etcPy[PyClass]
Here is an example of a user loading a DataFrame of videos, pulling out the first frame, and making that into a thumbnail and then saving it:
Notice that with this example:
Not included in the above example, but crucial for DaftTypes is the ability for users to move between DaftTypes and Python types. For example,
df["first_frame"].image.to_numpy()
will return aPY[np.ndarray]
column for users to then use Numpy to manipulate, and going the other direction should also be possible withDaftImage.from_numpy_column(df["some_numpy_column"])
.Beta Was this translation helpful? Give feedback.
All reactions