-
Notifications
You must be signed in to change notification settings - Fork 598
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: support empty arrays, improve ibis.array() API #9473
base: main
Are you sure you want to change the base?
Conversation
ACTION NEEDED Ibis follows the Conventional Commits specification for release automation. The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. |
Picking out the array stuff from ibis-project#8666 Instead of trying to fit the two cases of 0-length and 1+ length arrays into the same op, I split them up into separate ones. By doing this, if we guarantee that all the elements of ops.Array() have the right type before construction, we don't have to do any fancy casting during compilation, all the elements will already have been casted as needed. This allows for the compilation of array<structs> on some sql backends like postgres. If we tried to cast the entire array, you end up with SQL like `cast [..] as STRUCT<...>[]`, which postgres freaks about. Instead, if we cast each individual element, such as `[cast({...} as ROW..., cast({...} as ROW...]`, then this is valid SQL. I added a Length annotation to ops.Array to verify the length is 1+. IDK, this isn't really needed, since if you ever did construct one, then the rlz.highest_precedence_dtype([]) would fail. But that might fail at a later time, and I wanted to raise an error at construction time. But, end users should never be constructing ops.Arrays directly, so this is a guardrail just for us ibis devs. So IDK, we could remove it, but I think it is a nice hint for future us.
@cpcloud OK this is finally looking pretty good and is ready for a review! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's still too much unnecessary munging here in my opinion.
return ir.null(type) | ||
|
||
values = tuple(values) | ||
if len(values) == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use standard Python idioms for this.
if len(values) == 0: | |
if not values: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops sorry will do.
if values is None: | ||
if type is None: | ||
raise ValidationError("If values is None/NULL, type must be provided") | ||
return ir.null(type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like this. It's unnecessary to have to support another way to construct a null value.
If someone wants a NULL
with an array type, they should use the null
function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a thought here but I better do it at my computer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See these tests in mismo:
I need to construct an array<int>
with a variety of values:
- NULL
- []
- [1,2,3]
There are three different ibis APIs for doing this:
ibis.null(type="array<int>")
: this only works for the NULL case. I would need some if/else branching in my test setup to use this for the NULL case, but would need something else for the other cases.ibis.literal(val, type="array<int>")
. This is currently what I do, and it works for all 3 inputs here, no if/else branching required. But the docstring for literal() says that it is going to get deprecated soon for constructing complex types, and I should use array()/map()/struct() instead.ibis.array()
: If we keep the functionality the way I propose here, then I can doibis.array(val, type="array<int>")
and it works for all three input cases. If we don't allow passing in None, then I would need to do some if/else branching to use ibis.null() in that case.
I think it can be summarized as SOMEONE is gonna need to do this if values is None:
branching, and I would rather have it be down here in one place in library code, than in many places in user code.
return ops.Array(tuple(values)).to_expr() | ||
type = dt.dtype(type) if type is not None else None | ||
if type is not None and not isinstance(type, dt.Array): | ||
raise ValidationError(f"type must be an array, got {type}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just let this fall through and bubble up as it would? This function is already getting stuffed with a bunch of branching, and this doesn't seem to add much value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guarding against the footgun of someone passing in "float" instead of "array" (I literally accidentally did this as I adjusted the tests this last round), which would silently give the wrong result for some inputs. I agree not needed, I can remove, just a nicety
if isinstance(x, (ir.Value, Deferred)): | ||
return x.cast(type) if type is not None else x | ||
else: | ||
return ibis.literal(x, type=type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the purpose of literal
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you do ibis.array([1, 2], type="array<float>")
, we somehow need to get those ints to floats. My original implementation stored them in the array as ints, and convert to float during compilation. But that required storing the eventual type in ops.Array. so here we pre convert all the elements to floats, and now ops.Array can just use rlz.highest_precedence_dtype(). In order to do that pre casting, we need this literal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still not sure why this is required. Is there some API that fails to execute if you don't do these casts?
What if the elements themselves are arrays? Doesn't this then need to be recursive in the casting/call to _value
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shoot, github escaped the <>
I posted, I meant ibis.array([1, 2], type="array<float>")
not ibis.array([1, 2], type="array")
Now does the concern make sense? Two ways of storing this:
- store the inputs as is, and keep track of the dtype separately
op.Array(values=[1,2], value_type="float")
- cast the inputs early, then we don't need to store the dtype:
op.Array(values=[op.Literal(1, "float"), op.Literal(2, "float")])
I am trying to go for option 2.
I'm still not sure why this is required
Specificaly, if I remove the .literal() here, then we pass a plain python 1
into the op.Array constructor, and then using pattern matching it tries to coerce that to an ibis expression, which yields a op.Literal(1, int)
. But we wanted it to be a float, so we have to do the casting up front here.
What if the elements themselves are arrays
you mean what if x were a python list of ibis values? ie someone did array_of_arrays = ibis.array([[ibis.literal(1)], [ibis.literal(2)]])
? Yes, this would fail, good catch. not sure what to do about it yet....
I'm gonna be out of office for this week. So I hope when I get back I see that y'all will have figured it all out and everything is merged and happy 😁 |
Workaround for pola-rs/polars#17294 pl.concat_list(Iterable[T]) results in pl.List[T], EXCEPT when T is a pl.List, in which case pl.concat_list(Iterable[pl.List[T]]) results in pl.List[T]. If polars ever supports a more consistent array constructor, we should switch to that. Found this when working on #9473
@cpcloud whenever you get the chance, I responded to each of the individual comments, curious what you think. I agree it's sorta mungy, but I can't think of a cleaner way of supporting the API I want. I don't really want to cut anything from the API, but are you willing to? Eg not support ibis.array(None)? |
Perhaps the larger question here is the tension between
As is reasonable with our roles in ibis (user vs maintainer) It sounds like I care more about the 1st one, you about the 2nd haha 😂 One argument for favoring flexibility is consistency: I think we can both agree supporting list[ir.Value] is needed for ibis.array, .map,and .struct (they already do). but this sort of opens up Pandora's box: now these nested constructors are the only ibis constructor API that accepts ibis values. I want to bring the rest of the constructors in line with this. I think stopping with the scalar constructors would make sense, but also making ibis.memtable more generous with accepting ibis values could be a nice DX improvement, as I advocate for in some other issue. I'm curious what @kszucs @gforsyth @jcrist think about this tension. Should we make things easier for users, or keep our API maintainable? |
Part of the reason that pandas API is so large and complex to use and maintain (in my opinion) is that it favors flexibility over consistency. I think we should err on the side of consistency as a project, and provide flexibility where there's not a ton of additional complexity. That means that sometimes the complexity of handling multiple input types gets pushed to the user and I would argue that that is the best place to handle non-trivial input normalization since the application has much better information about how it wants those inputs to look. I am pretty against opening up all the For nested types and tables, things get really complex as now you have to decide whether to accept mixed Ibis-expression-Python-value objects, and are we going to validate all of that? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we please leave all the additional complexity and decisions out here beyond addressing the empty array case and allow the user to specify a type?
if type is not None and not isinstance(type, dt.Array): | ||
raise ValidationError(f"type must be an array, got {type}") | ||
|
||
if isinstance(values, ir.Value): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we're going to accept inputs like this. Can you get rid of this code path?
if isinstance(x, (ir.Value, Deferred)): | ||
return x.cast(type) if type is not None else x | ||
else: | ||
return ibis.literal(x, type=type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still not sure why this is required. Is there some API that fails to execute if you don't do these casts?
What if the elements themselves are arrays? Doesn't this then need to be recursive in the casting/call to _value
?
Yeah I think this is probably a good idea. I'll trim it down as I can. To confirm, I am assuming that |
No, I'm saying we shouldn't support this. It opens the door to mixing ibis expressions and python types, and we'd have to unify both and it's a lot of additional complexity. |
I currently have lots of code that looks like |
@NickCrews I am still having a lot of trouble following everything here. I suspect that the PR is still trying to do too much. Can you provide a minimum list of cases showing
and avoid speculating on specific solutions on how to get the desired behavior? This problem space (complex type casting and inference) is really hairy, and it's very very difficult to review related PRs if they are trying to address a bunch of things at once. |
Yes, I agree it is really hairy, and I think that would help. Maybe I start with a PR that just adds or changes tests to precisely show what the current behavior is |
@NickCrews what's the status of this one? Is the plan to close it and start a new separate PR with a more targeted scope? |
Picking out the array stuff from #8666.
I think this is a better approach than #9458
Instead of trying to fit the two cases of 0-length and 1+ length arrays into the same op, I split them up into separate ones.
By doing this, if we guarantee that all the elements of ops.Array() have the right type before construction,
we don't have to store an explicit dtype on ops.Array, and instead can just use rlz.highest_precedence_dtype(). And we we don't have to do fancy casting during compilation, all the elements will already have been casted as needed.
This allows for the compilation of array on some sql backends like postgres.
If we tried to cast the entire array, you end up with SQL like
cast [..] as STRUCT<...>[]
,which postgres freaks about.
Instead, if we cast each individual element,
such as
[cast({...} as ROW..., cast({...} as ROW...]
, then this is valid SQL.I added a Length annotation to ops.Array to verify the length is 1+. IDK, this isn't really needed, since if you ever did construct one, then the rlz.highest_precedence_dtype([]) would fail. But that might fail at a later time,
and I wanted to raise an error at construction time. But, end users should never be constructing ops.Arrays directly,
so this is a guardrail just for us ibis devs. So IDK, we could remove it, but I think it is a nice hint for future us.