feat: support empty arrays, improve ibis.array() API #9473

NickCrews · 2024-06-29T02:59:33Z

Picking out the array stuff from #8666.
I think this is a better approach than #9458

Instead of trying to fit the two cases of 0-length and 1+ length arrays into the same op, I split them up into separate ones.
By doing this, if we guarantee that all the elements of ops.Array() have the right type before construction,
we don't have to store an explicit dtype on ops.Array, and instead can just use rlz.highest_precedence_dtype(). And we we don't have to do fancy casting during compilation, all the elements will already have been casted as needed.

This allows for the compilation of array on some sql backends like postgres.
If we tried to cast the entire array, you end up with SQL like cast [..] as STRUCT<...>[],
which postgres freaks about.
Instead, if we cast each individual element,
such as [cast({...} as ROW..., cast({...} as ROW...], then this is valid SQL.

I added a Length annotation to ops.Array to verify the length is 1+. IDK, this isn't really needed, since if you ever did construct one, then the rlz.highest_precedence_dtype([]) would fail. But that might fail at a later time,
and I wanted to raise an error at construction time. But, end users should never be constructing ops.Arrays directly,
so this is a guardrail just for us ibis devs. So IDK, we could remove it, but I think it is a nice hint for future us.

github-actions · 2024-06-29T02:59:55Z

ACTION NEEDED

Ibis follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message.

Please update your PR title and description to match the specification.

Picking out the array stuff from ibis-project#8666 Instead of trying to fit the two cases of 0-length and 1+ length arrays into the same op, I split them up into separate ones. By doing this, if we guarantee that all the elements of ops.Array() have the right type before construction, we don't have to do any fancy casting during compilation, all the elements will already have been casted as needed. This allows for the compilation of array<structs> on some sql backends like postgres. If we tried to cast the entire array, you end up with SQL like `cast [..] as STRUCT<...>[]`, which postgres freaks about. Instead, if we cast each individual element, such as `[cast({...} as ROW..., cast({...} as ROW...]`, then this is valid SQL. I added a Length annotation to ops.Array to verify the length is 1+. IDK, this isn't really needed, since if you ever did construct one, then the rlz.highest_precedence_dtype([]) would fail. But that might fail at a later time, and I wanted to raise an error at construction time. But, end users should never be constructing ops.Arrays directly, so this is a guardrail just for us ibis devs. So IDK, we could remove it, but I think it is a nice hint for future us.

NickCrews · 2024-06-29T19:04:22Z

@cpcloud OK this is finally looking pretty good and is ready for a review!

cpcloud

There's still too much unnecessary munging here in my opinion.

cpcloud · 2024-06-30T12:55:22Z

ibis/expr/types/arrays.py

+        return ir.null(type)
+
+    values = tuple(values)
+    if len(values) == 0:


Please use standard Python idioms for this.

Suggested change

if len(values) == 0:

if not values:

Oops sorry will do.

cpcloud · 2024-06-30T12:56:35Z

ibis/expr/types/arrays.py

+    if values is None:
+        if type is None:
+            raise ValidationError("If values is None/NULL, type must be provided")
+        return ir.null(type)


I don't like this. It's unnecessary to have to support another way to construct a null value.

If someone wants a NULL with an array type, they should use the null function.

I have a thought here but I better do it at my computer

See these tests in mismo:

test_jaccard

test_array_min

test_array_mean_median

I need to construct an array<int> with a variety of values:

NULL

[]

[1,2,3]

There are three different ibis APIs for doing this:

ibis.null(type="array<int>"): this only works for the NULL case. I would need some if/else branching in my test setup to use this for the NULL case, but would need something else for the other cases.

ibis.literal(val, type="array<int>"). This is currently what I do, and it works for all 3 inputs here, no if/else branching required. But the docstring for literal() says that it is going to get deprecated soon for constructing complex types, and I should use array()/map()/struct() instead.

ibis.array(): If we keep the functionality the way I propose here, then I can do ibis.array(val, type="array<int>") and it works for all three input cases. If we don't allow passing in None, then I would need to do some if/else branching to use ibis.null() in that case.

I think it can be summarized as SOMEONE is gonna need to do this if values is None: branching, and I would rather have it be down here in one place in library code, than in many places in user code.

cpcloud · 2024-06-30T12:57:34Z

ibis/expr/types/arrays.py

-    return ops.Array(tuple(values)).to_expr()
+    type = dt.dtype(type) if type is not None else None
+    if type is not None and not isinstance(type, dt.Array):
+        raise ValidationError(f"type must be an array, got {type}")


Can we just let this fall through and bubble up as it would? This function is already getting stuffed with a bunch of branching, and this doesn't seem to add much value.

I'm guarding against the footgun of someone passing in "float" instead of "array" (I literally accidentally did this as I adjusted the tests this last round), which would silently give the wrong result for some inputs. I agree not needed, I can remove, just a nicety

cpcloud · 2024-06-30T12:58:25Z

ibis/expr/types/arrays.py

+    if isinstance(x, (ir.Value, Deferred)):
+        return x.cast(type) if type is not None else x
+    else:
+        return ibis.literal(x, type=type)


What is the purpose of literal here?

If you do ibis.array([1, 2], type="array<float>"), we somehow need to get those ints to floats. My original implementation stored them in the array as ints, and convert to float during compilation. But that required storing the eventual type in ops.Array. so here we pre convert all the elements to floats, and now ops.Array can just use rlz.highest_precedence_dtype(). In order to do that pre casting, we need this literal.

I'm still not sure why this is required. Is there some API that fails to execute if you don't do these casts?

What if the elements themselves are arrays? Doesn't this then need to be recursive in the casting/call to _value?

Shoot, github escaped the <> I posted, I meant ibis.array([1, 2], type="array<float>") not ibis.array([1, 2], type="array")

Now does the concern make sense? Two ways of storing this:

store the inputs as is, and keep track of the dtype separately op.Array(values=[1,2], value_type="float")

cast the inputs early, then we don't need to store the dtype: op.Array(values=[op.Literal(1, "float"), op.Literal(2, "float")])

I am trying to go for option 2.

I'm still not sure why this is required

Specificaly, if I remove the .literal() here, then we pass a plain python 1 into the op.Array constructor, and then using pattern matching it tries to coerce that to an ibis expression, which yields a op.Literal(1, int). But we wanted it to be a float, so we have to do the casting up front here.

What if the elements themselves are arrays

you mean what if x were a python list of ibis values? ie someone did array_of_arrays = ibis.array([[ibis.literal(1)], [ibis.literal(2)]])? Yes, this would fail, good catch. not sure what to do about it yet....

NickCrews · 2024-07-01T05:49:51Z

I'm gonna be out of office for this week. So I hope when I get back I see that y'all will have figured it all out and everything is merged and happy 😁

Workaround for pola-rs/polars#17294 pl.concat_list(Iterable[T]) results in pl.List[T], EXCEPT when T is a pl.List, in which case pl.concat_list(Iterable[pl.List[T]]) results in pl.List[T]. If polars ever supports a more consistent array constructor, we should switch to that. Found this when working on #9473

NickCrews · 2024-07-10T15:03:05Z

@cpcloud whenever you get the chance, I responded to each of the individual comments, curious what you think. I agree it's sorta mungy, but I can't think of a cleaner way of supporting the API I want. I don't really want to cut anything from the API, but are you willing to? Eg not support ibis.array(None)?

NickCrews · 2024-07-10T15:20:05Z

Perhaps the larger question here is the tension between

provide a high level API that is very flexible in what it takes in (None, python lists of python vals, python lists of ibis exprs, ibis array exprs)
don't over guarantee anything that then we are going to have to continue supporting.

As is reasonable with our roles in ibis (user vs maintainer) It sounds like I care more about the 1st one, you about the 2nd haha 😂

One argument for favoring flexibility is consistency: I think we can both agree supporting list[ir.Value] is needed for ibis.array, .map,and .struct (they already do). but this sort of opens up Pandora's box: now these nested constructors are the only ibis constructor API that accepts ibis values. I want to bring the rest of the constructors in line with this. I think stopping with the scalar constructors would make sense, but also making ibis.memtable more generous with accepting ibis values could be a nice DX improvement, as I advocate for in some other issue.

I'm curious what @kszucs @gforsyth @jcrist think about this tension. Should we make things easier for users, or keep our API maintainable?

cpcloud · 2024-07-15T16:37:27Z

Part of the reason that pandas API is so large and complex to use and maintain (in my opinion) is that it favors flexibility over consistency. I think we should err on the side of consistency as a project, and provide flexibility where there's not a ton of additional complexity.

That means that sometimes the complexity of handling multiple input types gets pushed to the user and I would argue that that is the best place to handle non-trivial input normalization since the application has much better information about how it wants those inputs to look.

I am pretty against opening up all the ibis.* constructor APIs to accept other Ibis expressions as it adds a bunch of requirements and decision making around what's valid and how we validate the input beyond what we're already doing.

For nested types and tables, things get really complex as now you have to decide whether to accept mixed Ibis-expression-Python-value objects, and are we going to validate all of that?

cpcloud

Can we please leave all the additional complexity and decisions out here beyond addressing the empty array case and allow the user to specify a type?

cpcloud · 2024-07-15T16:39:37Z

ibis/expr/types/arrays.py

+    if type is not None and not isinstance(type, dt.Array):
+        raise ValidationError(f"type must be an array, got {type}")
+
+    if isinstance(values, ir.Value):


I don't think we're going to accept inputs like this. Can you get rid of this code path?

cpcloud · 2024-07-15T16:42:15Z

ibis/expr/types/arrays.py

+    if isinstance(x, (ir.Value, Deferred)):
+        return x.cast(type) if type is not None else x
+    else:
+        return ibis.literal(x, type=type)


I'm still not sure why this is required. Is there some API that fails to execute if you don't do these casts?

What if the elements themselves are arrays? Doesn't this then need to be recursive in the casting/call to _value?

NickCrews · 2024-07-16T01:57:33Z

Can we please leave all the additional complexity and decisions out here beyond addressing the empty array case and allow the user to specify a type?

Yeah I think this is probably a good idea. I'll trim it down as I can.

To confirm, I am assuming that array_of_arrays = ibis.array([[ibis.literal(1)], [ibis.literal(2)]]) is a valid input. You also want to be able to support this, right?

cpcloud · 2024-07-16T12:49:47Z

To confirm, I am assuming that array_of_arrays = ibis.array([[ibis.literal(1)], [ibis.literal(2)]]) is a valid input. You also want to be able to support this, right?

No, I'm saying we shouldn't support this. It opens the door to mixing ibis expressions and python types, and we'd have to unify both and it's a lot of additional complexity.

NickCrews · 2024-07-16T17:21:54Z

I currently have lots of code that looks like t.mutate(emails=ibis.array([t.email0, t.email1, t.email2]). Am I missing something here, how can I accomplish this goal?

cpcloud · 2024-07-23T13:07:35Z

@NickCrews I am still having a lot of trouble following everything here. I suspect that the PR is still trying to do too much.

Can you provide a minimum list of cases showing

the current behavior
the desired behavior

and avoid speculating on specific solutions on how to get the desired behavior? This problem space (complex type casting and inference) is really hairy, and it's very very difficult to review related PRs if they are trying to address a bunch of things at once.

NickCrews · 2024-07-23T19:40:39Z

Yes, I agree it is really hairy, and I think that would help. Maybe I start with a PR that just adds or changes tests to precisely show what the current behavior is

ncclementi · 2024-08-28T14:47:14Z

@NickCrews what's the status of this one? Is the plan to close it and start a new separate PR with a more targeted scope?

NickCrews force-pushed the array-api2 branch from a8e6dd8 to cbd2510 Compare June 29, 2024 17:48

NickCrews force-pushed the array-api2 branch from cbd2510 to a655252 Compare June 29, 2024 18:20

NickCrews changed the title ~~WIP: try ops.EmptyArray~~ feat: support empty arrays, improve ibis.array() API Jun 29, 2024

NickCrews mentioned this pull request Jun 29, 2024

feat(api): improve ibis.array() #9458

Closed

NickCrews marked this pull request as ready for review June 29, 2024 18:45

Merge branch 'main' into array-api2

1162137

NickCrews enabled auto-merge (rebase) June 29, 2024 20:23

cpcloud requested changes Jun 30, 2024

View reviewed changes

This was referenced Jun 30, 2024

separate pl.list() and pl.concat_list pola-rs/polars#17307

Closed

Split off list constructor logic from pl.concat_list into pl.list pola-rs/polars#8510

Open

fix(polars): add workaround to compile Array<Array> correctly #9484

Merged

cpcloud requested changes Jul 15, 2024

View reviewed changes

NickCrews mentioned this pull request Jul 19, 2024

feat: Support type arg to ibis.array and ibis.map #8289

Open

1 task

cpcloud disabled auto-merge July 23, 2024 13:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support empty arrays, improve ibis.array() API #9473

feat: support empty arrays, improve ibis.array() API #9473

NickCrews commented Jun 29, 2024 •

edited

Loading

github-actions bot commented Jun 29, 2024

NickCrews commented Jun 29, 2024

cpcloud left a comment

cpcloud Jun 30, 2024

NickCrews Jun 30, 2024

cpcloud Jun 30, 2024

NickCrews Jun 30, 2024

NickCrews Jun 30, 2024

cpcloud Jun 30, 2024

NickCrews Jun 30, 2024

cpcloud Jun 30, 2024

NickCrews Jun 30, 2024 •

edited

Loading

cpcloud Jul 15, 2024

NickCrews Jul 16, 2024

NickCrews commented Jul 1, 2024

NickCrews commented Jul 10, 2024

NickCrews commented Jul 10, 2024

cpcloud commented Jul 15, 2024

cpcloud left a comment

cpcloud Jul 15, 2024

cpcloud Jul 15, 2024

NickCrews commented Jul 16, 2024

cpcloud commented Jul 16, 2024

NickCrews commented Jul 16, 2024

cpcloud commented Jul 23, 2024 •

edited

Loading

NickCrews commented Jul 23, 2024

ncclementi commented Aug 28, 2024

feat: support empty arrays, improve ibis.array() API #9473

Are you sure you want to change the base?

feat: support empty arrays, improve ibis.array() API #9473

Conversation

NickCrews commented Jun 29, 2024 • edited Loading

github-actions bot commented Jun 29, 2024

NickCrews commented Jun 29, 2024

cpcloud left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NickCrews Jun 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NickCrews commented Jul 1, 2024

NickCrews commented Jul 10, 2024

NickCrews commented Jul 10, 2024

cpcloud commented Jul 15, 2024

cpcloud left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NickCrews commented Jul 16, 2024

cpcloud commented Jul 16, 2024

NickCrews commented Jul 16, 2024

cpcloud commented Jul 23, 2024 • edited Loading

NickCrews commented Jul 23, 2024

ncclementi commented Aug 28, 2024

NickCrews commented Jun 29, 2024 •

edited

Loading

NickCrews Jun 30, 2024 •

edited

Loading

cpcloud commented Jul 23, 2024 •

edited

Loading