Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ESQL: TOP support for strings #113183

Merged
merged 14 commits into from
Sep 23, 2024
Merged

Conversation

nik9000
Copy link
Member

@nik9000 nik9000 commented Sep 19, 2024

Adds support to the TOP aggregation for keyword and text field
types.

Closes #109849

Adds support to the `TOP` aggregation for `keyword` and `text` field
types.
Copy link
Contributor

Documentation preview:

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

@elasticsearchmachine elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Sep 19, 2024
@elasticsearchmachine
Copy link
Collaborator

Hi @nik9000, I've created a changelog YAML for you.

/**
* Components common to BucketedSort implementations.
*/
class BucketedSortCommon implements Releasable {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I yanked this out because it looked like it'd be safe to share at least a little code. I didn't plug this into the X-BucketedSort classes yet. But I think it's just about the same thing.


public class TopIpAggregatorFunctionTests extends AbstractTopBytesRefAggregatorFunctionTests {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I yanked the bytes behavior to a common class. It's tiny, but feels like it saves a bit of copy and paste and the compiler will tell you the variant bits.

@nik9000
Copy link
Member Author

nik9000 commented Sep 19, 2024

buildkite run buildkite/docs-build-pr

Copy link
Contributor

@ivancea ivancea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

Comment on lines 47 to 53
long endIndex(long rootIndex) {
return rootIndex + bucketSize;
}

long requiredSize(long rootIndex) {
return rootIndex + bucketSize;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we merge those?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, huh, that makes sense. Will do.

this.order = order;
this.bucketSize = bucketSize;
heapMode = new BitArray(0, bigArrays);
this.common = new BucketedSortCommon(bigArrays, order, bucketSize);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No inheritance? Shouldn't final methods be safe to use?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I sure could have inherited it. I started that way because it felt easier but the ctor with the sub-types and the closing and.... for that, at least, it felt easier to compose.

if (DataType.isString(valueType) == false) {
continue;
}
suppliers.add(new TestCaseSupplier(List.of(valueType), () -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a MultiRowTestCaseSupplier.stringCases(), maybe use it with the other cases? It has a param for the expected DataType

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -100,7 +103,13 @@ $endif$
private final $Name$BucketedSort sort;

private GroupingState(BigArrays bigArrays, int limit, boolean ascending) {
$if(BytesRef)$
// TODO pass the breaker in from the DriverContext
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️✋👮 Is this TODO for this PR or for later? If for later, I foresee it will be there for a long time :hehe:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I feel like I'll either have to do it in the next follow up or, well, it'll wait.

It'll be more consistent if we do it, but we do only ever use the request breaker so it is safe enough as it.

values.set(start + i, null);
}

// TODO: Make use of heap structures to faster iterate in order instead of copying and sorting
Copy link
Contributor

@ivancea ivancea Sep 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unrelated, but I'm thinking now: Isn't this still nlogn? Would this really be better over an in-place sort?

Saying this because we have this comment everywhere, and I'm not sure if it really can be done. Maybe I'm missing some trick

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iterating the heap is O(n), right? We aren't removing and re-heaping. We're just iterating in order.

Also it'd save a copy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iterating, yes. But sorting, it's still nlogn for a heap. Unless our heapify keeps it "sortable". But I'd say that would be slower.
To sort it, the heap tells us the min value. But then, the next candidates are the 2 children. Then, it would be 3 potential candidates (1 child + 2 grand-children), and so on. Worst case, like re-heapifying on every iteration

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, yeah, it's still n log n. I presume it's better because it can rely on the heap property being there already. But I agree, it's probably not worth a ton of time on.

@elasticsearchmachine
Copy link
Collaborator

Hi @nik9000, I've updated the changelog YAML for you.

@nik9000
Copy link
Member Author

nik9000 commented Sep 19, 2024

@ivancea, I believe I've fixed the things you mentioned. Can you think of anything else that's left for this one?

@@ -382,13 +377,10 @@ private BreakingBytesRefBuilder clearedBytesAt(long index) {

@Override
public final void close() {
Releasables.close(() -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I thought this was an interesting, safe trick. Some reason to swap to wrap()? For future cases

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly paranoia around if one fails. wrap will continue even on close.

I feel like i go to a lot of trouble to call these methods to make sure closing happens right. Partly that's paranoia - it can't fail. But partly that just so readers see it and say "the normal close code" - they see a call to wrap and stuff as "normal"

@nik9000 nik9000 added auto-merge Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) auto-backport-and-merge Automatically create backport pull requests and merge when ready labels Sep 23, 2024
@elasticsearchmachine elasticsearchmachine merged commit 58021c3 into elastic:main Sep 23, 2024
15 checks passed
@nik9000 nik9000 deleted the esql_top_bytes branch September 23, 2024 17:00
nik9000 added a commit to nik9000/elasticsearch that referenced this pull request Sep 23, 2024
Adds support to the `TOP` aggregation for `keyword` and `text` field
types.

Closes elastic#109849
@elasticsearchmachine
Copy link
Collaborator

💚 Backport successful

Status Branch Result
8.x

elasticsearchmachine pushed a commit that referenced this pull request Sep 25, 2024
Adds support to the `TOP` aggregation for `keyword` and `text` field
types.

Closes #109849
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/ES|QL AKA ESQL auto-backport-and-merge Automatically create backport pull requests and merge when ready auto-merge Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) >feature Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v8.16.0 v9.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ESQL: Support non-numeric comparable types in aggregations: TOP, MAX, MIN...
3 participants