-
Notifications
You must be signed in to change notification settings - Fork 24.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ESQL: TOP support for strings #113183
ESQL: TOP support for strings #113183
Conversation
Adds support to the `TOP` aggregation for `keyword` and `text` field types.
Documentation preview: |
Pinging @elastic/es-analytical-engine (Team:Analytics) |
Hi @nik9000, I've created a changelog YAML for you. |
/** | ||
* Components common to BucketedSort implementations. | ||
*/ | ||
class BucketedSortCommon implements Releasable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I yanked this out because it looked like it'd be safe to share at least a little code. I didn't plug this into the X-BucketedSort
classes yet. But I think it's just about the same thing.
|
||
public class TopIpAggregatorFunctionTests extends AbstractTopBytesRefAggregatorFunctionTests { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I yanked the bytes behavior to a common class. It's tiny, but feels like it saves a bit of copy and paste and the compiler will tell you the variant bits.
buildkite run buildkite/docs-build-pr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
long endIndex(long rootIndex) { | ||
return rootIndex + bucketSize; | ||
} | ||
|
||
long requiredSize(long rootIndex) { | ||
return rootIndex + bucketSize; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we merge those?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, huh, that makes sense. Will do.
this.order = order; | ||
this.bucketSize = bucketSize; | ||
heapMode = new BitArray(0, bigArrays); | ||
this.common = new BucketedSortCommon(bigArrays, order, bucketSize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No inheritance? Shouldn't final
methods be safe to use?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I sure could have inherited it. I started that way because it felt easier but the ctor with the sub-types and the closing and.... for that, at least, it felt easier to compose.
if (DataType.isString(valueType) == false) { | ||
continue; | ||
} | ||
suppliers.add(new TestCaseSupplier(List.of(valueType), () -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a MultiRowTestCaseSupplier.stringCases()
, maybe use it with the other cases? It has a param for the expected DataType
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@@ -100,7 +103,13 @@ $endif$ | |||
private final $Name$BucketedSort sort; | |||
|
|||
private GroupingState(BigArrays bigArrays, int limit, boolean ascending) { | |||
$if(BytesRef)$ | |||
// TODO pass the breaker in from the DriverContext |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. I feel like I'll either have to do it in the next follow up or, well, it'll wait.
It'll be more consistent if we do it, but we do only ever use the request breaker so it is safe enough as it.
...sql/compute/src/test/java/org/elasticsearch/compute/data/sort/BytesRefBucketedSortTests.java
Outdated
Show resolved
Hide resolved
...gin/esql/compute/src/main/java/org/elasticsearch/compute/data/sort/BytesRefBucketedSort.java
Outdated
Show resolved
Hide resolved
values.set(start + i, null); | ||
} | ||
|
||
// TODO: Make use of heap structures to faster iterate in order instead of copying and sorting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's unrelated, but I'm thinking now: Isn't this still nlogn? Would this really be better over an in-place sort?
Saying this because we have this comment everywhere, and I'm not sure if it really can be done. Maybe I'm missing some trick
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Iterating the heap is O(n)
, right? We aren't removing and re-heaping. We're just iterating in order.
Also it'd save a copy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Iterating, yes. But sorting, it's still nlogn for a heap. Unless our heapify keeps it "sortable". But I'd say that would be slower.
To sort it, the heap tells us the min value. But then, the next candidates are the 2 children. Then, it would be 3 potential candidates (1 child + 2 grand-children), and so on. Worst case, like re-heapifying on every iteration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, yeah, it's still n log n. I presume it's better because it can rely on the heap property being there already. But I agree, it's probably not worth a ton of time on.
...gin/esql/compute/src/main/java/org/elasticsearch/compute/data/sort/BytesRefBucketedSort.java
Show resolved
Hide resolved
Hi @nik9000, I've updated the changelog YAML for you. |
@ivancea, I believe I've fixed the things you mentioned. Can you think of anything else that's left for this one? |
@@ -382,13 +377,10 @@ private BreakingBytesRefBuilder clearedBytesAt(long index) { | |||
|
|||
@Override | |||
public final void close() { | |||
Releasables.close(() -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I thought this was an interesting, safe trick. Some reason to swap to wrap()? For future cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly paranoia around if one fails. wrap
will continue even on close.
I feel like i go to a lot of trouble to call these methods to make sure closing happens right. Partly that's paranoia - it can't fail. But partly that just so readers see it and say "the normal close code" - they see a call to wrap
and stuff as "normal"
Adds support to the `TOP` aggregation for `keyword` and `text` field types. Closes elastic#109849
💚 Backport successful
|
Adds support to the
TOP
aggregation forkeyword
andtext
fieldtypes.
Closes #109849