Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collect and display execution metadata for ES|QL cross cluster searches #112595

Merged
merged 70 commits into from
Sep 30, 2024
Merged
Show file tree
Hide file tree
Changes from 28 commits
Commits
Show all changes
70 commits
Select commit Hold shift + click to select a range
6ebc396
Collect and display execution metadata for ES|QL cross cluster searches
quux00 Sep 6, 2024
512fdec
Slight improvements to EsqlExecutionInfo
quux00 Sep 6, 2024
39428fa
Removed changes to EsqlQueryResponse, spending too long getting the E…
quux00 Sep 6, 2024
892cd99
Starting threading EsqlExecutionInfo into PlanExecutor and EsqlSesssi…
quux00 Sep 6, 2024
fb10109
Have the initial swap-in of cluster info into EsqlExecutionInfo in Es…
quux00 Sep 6, 2024
797cc8c
Added EsqlExecutionInfo to IndexResolver. Enrich pathway passes in nu…
quux00 Sep 6, 2024
fa7bbb0
ComputeListener updated to the version that has proper remote/local s…
quux00 Sep 9, 2024
71a33ed
Added new tests to ComputeListenerTests
quux00 Sep 9, 2024
ab347a6
Added ExecutionInfo to Result obj (used in ComputeService/EsqlSession)
quux00 Sep 9, 2024
c39111b
update ExecutionInfo with shard counts in ComputeService.lookupDataNodes
quux00 Sep 9, 2024
1a3a7f8
Migrated CrossClustersQueryIT to new setup format, but can't add exec…
quux00 Sep 9, 2024
544aaeb
Added CountDown to acquireComputeForDataNodes - that allows SUCCESSFU…
quux00 Sep 10, 2024
ca2de85
Fixed failing REST and qa tests to account for the new 'took' time in…
quux00 Sep 10, 2024
f839132
Fixed bug where CountDown in ComputeService can be initialized with 0…
quux00 Sep 10, 2024
3f3139b
More qa and bwc test fixes based on what failed in latest ci build
quux00 Sep 10, 2024
e090437
Next round of qa and bwc test fixes based on what failed in latest ci…
quux00 Sep 11, 2024
b2b2542
Fix failing test in EsqlSecurityIT
quux00 Sep 11, 2024
5e7876e
Added _cluster/details to the EsqlQueryResponse XContent for cross-cl…
quux00 Sep 11, 2024
3e16fbb
Fixed test failure in esql/ccq/MultiClustersIT
quux00 Sep 11, 2024
ed5b9db
Updated end user docs with info about top level took time and _cluste…
quux00 Sep 11, 2024
661a243
Added EsqlExecutionInfo to equals and hashCode method of EsqlQueryRes…
quux00 Sep 12, 2024
9535bdd
Removed skip_unavilable=true filter in IndexResolver - all clusters a…
quux00 Sep 12, 2024
699b16a
Moved isRemoteUnavailableException to ExceptionsHelper
quux00 Sep 12, 2024
228eed2
Added equals and hashCode to EsqlExecutionInfo.Cluster object.
quux00 Sep 13, 2024
fa9c7c4
Minor tweak to esql-across-clusters.asciidoc
quux00 Sep 13, 2024
c51719e
Improvements to esql-across-clusters.asciidoc
quux00 Sep 13, 2024
d365c37
Update docs/changelog/112595.yaml
quux00 Sep 13, 2024
8688dbf
Added questions about took time headers to EsqlResponseListener - pos…
quux00 Sep 13, 2024
5b93774
Merge remote-tracking branch 'elastic/main' into esql/ccs-execution-i…
quux00 Sep 16, 2024
449e1a7
Merge remote-tracking branch 'elastic/main' into esql/ccs-execution-i…
quux00 Sep 17, 2024
0083ae7
Merge remote-tracking branch 'elastic/main' into esql/ccs-execution-i…
quux00 Sep 18, 2024
5462d6b
PR feedback with focus on end user docs fixes, removing some out-of-d…
quux00 Sep 18, 2024
5f27325
Merge remote-tracking branch 'elastic/main' into esql/ccs-execution-i…
quux00 Sep 18, 2024
9e77c28
Additional PR feedback changes - test adjustments, remove 'set' and '…
quux00 Sep 19, 2024
fd6d3bf
Merge remote-tracking branch 'elastic/main' into esql/ccs-execution-i…
quux00 Sep 19, 2024
6e87174
Now tracking took in nanos, not millis (but XContent still displays i…
quux00 Sep 19, 2024
b474fc7
Merge remote-tracking branch 'elastic/main' into esql/ccs-execution-i…
quux00 Sep 19, 2024
1649962
Merge remote-tracking branch 'elastic/main' into esql/ccs-execution-i…
quux00 Sep 19, 2024
20c9356
Changed ComputeResponse to de/serialize with read/writeOptionalTimeValue
quux00 Sep 19, 2024
e6aa92a
EsqlResponseListener now preferentially uses the took time in the Esq…
quux00 Sep 19, 2024
59d1480
Merge remote-tracking branch 'elastic/main' into esql/ccs-execution-i…
quux00 Sep 20, 2024
8bb1b7f
Merge remote-tracking branch 'elastic/main' into esql/ccs-execution-i…
quux00 Sep 20, 2024
6323cdf
Modified esql-across-clusters to run the new queries I added; but JSO…
quux00 Sep 20, 2024
4875a66
Merge remote-tracking branch 'elastic/main' into esql/ccs-execution-i…
quux00 Sep 20, 2024
24e0c02
Merge remote-tracking branch 'elastic/main' into esql/ccs-execution-i…
quux00 Sep 20, 2024
940ef22
Merge remote-tracking branch 'elastic/main' into esql/ccs-execution-i…
quux00 Sep 20, 2024
617cbec
Merge remote-tracking branch 'elastic/main' into esql/ccs-execution-i…
quux00 Sep 23, 2024
c45d181
Removed code that lists fully resolved indices in the _clusters/detai…
quux00 Sep 23, 2024
13a34de
Merge remote-tracking branch 'elastic/main' into esql/ccs-execution-i…
quux00 Sep 23, 2024
9ac5746
Code cleanup - remove commented out code in IndexResolverTests
quux00 Sep 23, 2024
3afb7a1
PR feedback: Moved logic for unavailable/missing clusters to EsqlSession
quux00 Sep 23, 2024
b118406
Merge remote-tracking branch 'elastic/main' into esql/ccs-execution-i…
quux00 Sep 24, 2024
fc53eb7
PR feedback: I removed acquireCCSCompute and acquireComputeForDatanod…
quux00 Sep 25, 2024
d50658c
Merge remote-tracking branch 'elastic/main' into esql/ccs-execution-i…
quux00 Sep 26, 2024
dc467ac
PR feedback: Created new intf IndicesExpressionResolver and have Remo…
quux00 Sep 26, 2024
711c1f8
Merge remote-tracking branch 'elastic/main' into esql/ccs-execution-i…
quux00 Sep 26, 2024
8e5f170
checkstyle fix
quux00 Sep 27, 2024
838e6a9
Merge remote-tracking branch 'elastic/main' into esql/ccs-execution-i…
quux00 Sep 27, 2024
77cc107
Moved parseClusterAlias from IndexResolver to RemoteClusterAware and …
quux00 Sep 27, 2024
0e71453
Renamed IndicesExpressionResolver intf to IndicesExpressionGrouper.
quux00 Sep 27, 2024
a69f3db
Merge remote-tracking branch 'elastic/main' into esql/ccs-execution-i…
quux00 Sep 27, 2024
a7efbec
PR feedback: Added javadoc to ComputeListener, removed leftover debug…
quux00 Sep 27, 2024
aa8bbaa
Fixed bug where SKIPPED status for unavailable clusters from field-ca…
quux00 Sep 27, 2024
e5e45b5
Merge remote-tracking branch 'elastic/main' into esql/ccs-execution-i…
quux00 Sep 27, 2024
9a304c2
Merge remote-tracking branch 'elastic/main' into esql/ccs-execution-i…
quux00 Sep 30, 2024
d79af98
PR feedback
quux00 Sep 30, 2024
826aab7
Merge remote-tracking branch 'elastic/main' into esql/ccs-execution-i…
quux00 Sep 30, 2024
ec99687
Changed status to SKIPPED when no matching index found for remote clu…
quux00 Sep 30, 2024
02092e9
Merge remote-tracking branch 'elastic/main' into esql/ccs-execution-i…
quux00 Sep 30, 2024
8569bfa
Merge remote-tracking branch 'elastic/main' into esql/ccs-execution-i…
quux00 Sep 30, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions docs/changelog/112595.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
pr: 112595
summary: Collect and display execution metadata for ES|QL cross cluster searches
area: ES|QL
type: enhancement
issues:
- 112402
183 changes: 180 additions & 3 deletions docs/reference/esql/esql-across-clusters.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -85,7 +85,7 @@ POST /_security/role/remote1
"privileges": [ "read","read_cross_cluster" ], <4>
"clusters" : ["my_remote_cluster"] <5>
}
],
],
"remote_cluster": [ <6>
{
"privileges": [
Expand Down Expand Up @@ -174,6 +174,184 @@ FROM *:my-index-000001
| LIMIT 10
----

[discrete]
[[ccq-cluster-details]]
==== Cross-cluster metadata

ES|QL {ccs} responses include metadata about the search on each cluster when the response format is JSON.
Here we show an example using the async search endpoint. {ccs-cap} metadata is also present in the synchronous
search endpoint.

[source,esql]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you make this a console snippet instead and make the test runner happy with it with something like // TEST[setup:my_index]. That way we run it so if it ever goes out of date we fail the build.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we'd have to do // TEST[s/cluster_one:my-index-000001,cluster_two:my-index//] to remove the multi-cluster from the test case.....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I have changed the two queries I added to actually be executed, but (see comment below) I am unable to find a way to test the response JSON, so I have left that as "TEST[skip...]".

----
POST /_query/async?format=json
{
"query": """
FROM my-index-000001,cluster_one:my-index-000001,cluster_two:my-index*
| KEEP author, name, page_count
| SORT page_count DESC
| LIMIT 50
"""
}
----

Which returns:

[source,console-result]
----
{
"is_running": false,
"took": 42, <1>
"columns": [
... // not shown
],
"values": [
... // not shown
],
"_clusters": { <2>
"total": 3,
"successful": 3,
"running": 0,
"skipped": 0,
"partial": 0,
"failed": 0,
"details": { <3>
"(local)": { <4>
"status": "successful",
"indices": "blogs",
"took": 36, <5>
"_shards": { <6>
"total": 13,
"successful": 13,
"skipped": 0,
"failed": 0
}
},
"cluster_one": {
"status": "successful",
"indices": "cluster_one:my-index-000001",
"took": 38,
"_shards": {
"total": 4,
"successful": 4,
"skipped": 0,
"failed": 0
}
},
"cluster_two": {
"status": "successful",
"indices": "cluster_two:my-index-000001", <7>
"took": 41,
"_shards": {
"total": 18,
"successful": 18,
"skipped": 1,
"failed": 0
}
}
}
}
}
----
// TEST[skip: cross-cluster testing env not set up]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The local cluster should be available, right? Could we remove the multi-cluster output so we get the assertion that the shape is pretty close?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent several hours trying but unless you know of a trick to do clever multi-line matching I don't see how this is possible. Among the things I tried was adding "m" to the end of the matcher to indicate multi-line matching (as in Perl matching), but that doesn't work. Mostly I just get failed runs with no information as to what is wrong.

Plus I'm not really sure it's worth it? The whole point of this section is to show the _clusters/details section so testing against a non-CCS set up doesn't seem useful.

We probably need another ticket to enable the multi-cluster testing setup that search-across-clusters.asciidoc uses, as that was not set up for this test.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍


<1> How the long the entire search (across all clusters) took, in milliseconds.
quux00 marked this conversation as resolved.
Show resolved Hide resolved
<2> This section of counters shows all possible cluster search states and how many cluster
searches are currently in that state. The clusters can be one of the following statuses: *running*,
quux00 marked this conversation as resolved.
Show resolved Hide resolved
*successful* (searches on all shards were successful), *partial* (searches on at least
one shard of the cluster was successful and at least one failed), *skipped* (the search
failed on a cluster marked with `skip_unavailable`=`true`) or *failed* (the search
failed on a cluster marked with `skip_unavailable`=`false`).
<3> The `_clusters/details` section shows metadata about the search on each cluster.
<4> If you included indices from the local cluster you sent the request to in your {ccs},
it is identified as "(local)".
<5> How long (in milliseconds) the search took on each cluster. This can be useful to determine
which clusters have slower response times than others.
<6> The shard details for the search on that cluster, including a count of shards that were
skipped due to the can-match phase indicating it had no matching data so it did not need
to be included in the full ESQL query.
quux00 marked this conversation as resolved.
Show resolved Hide resolved
<7> The index expression supplied by the user. If you provide a wildcard such as `my-index*`,
this section will show the resolved index name(s) here, unless no matching indices could
be found on that cluster, in which case the wildcard expression will be retained here.


The cross-cluster metadata can be used to determine if any data came back from a cluster.
For instance in this query, you see that wildcard expression for `cluster-one` did not
resolve to a concrete index (or indices) and that the total number of shards searched is
zero. This indicates that no matching index was found on that cluster. But since the other
cluster did have a matching index, the search did not return an error, but instead
returned all the matching data it could find.

[source,esql]
----
POST /_query/async?format=json
{
"query": """
FROM cluster_one:my-index*,cluster_two:logs*
| KEEP author, name, page_count
| SORT page_count DESC
| LIMIT 5
"""
}
----

Which returns:

[source,console-result]
----
{
"is_running": false,
"took": 55,
"columns": [
... // not shown
],
"values": [
... // not shown
],
"_clusters": {
"total": 2,
"successful": 2,
"running": 0,
"skipped": 0,
"partial": 0,
"failed": 0,
"details": {
"cluster_one": {
"status": "successful",
"indices": "cluster_one:my-index-000001",
"took": 38,
"_shards": {
"total": 4,
"successful": 4,
"skipped": 0,
"failed": 0
}
},
"cluster_two": {
"status": "successful", <1>
"indices": "cluster_two:logs*", <2>
"took": 0,
"_shards": {
"total": 0, <3>
"successful": 0,
"skipped": 0,
"failed": 0
}
}
}
}
}
----
// TEST[skip: cross-cluster testing env not set up]

<1> This search is still marked as successful, even though no data was searched.
<2> Since there were no matching indices for the wildcard pattern provided, the original
index expression provided by the user is retained here.
<3> Indicates that no shards were searched (due to not having any matching indices).




Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@naj-h and @tylerperk - Please review proposed end user docs changes.

[discrete]
[[ccq-enrich]]
==== Enrich across clusters
Expand Down Expand Up @@ -331,8 +509,7 @@ setting. As a result, if a remote cluster specified in the request is
unavailable or failed, {ccs} for {esql} queries will fail regardless of the setting.

We are actively working to align the behavior of {ccs} for {esql} with other
{ccs} APIs. This includes providing detailed execution information for each cluster
in the response, such as execution time, selected target indices, and shards.
{ccs} APIs.

[discrete]
[[ccq-during-upgrade]]
Expand Down
5 changes: 4 additions & 1 deletion docs/reference/esql/esql-rest.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -192,6 +192,7 @@ Which returns:
[source,console-result]
----
{
"took": 28,
"columns": [
{"name": "author", "type": "text"},
{"name": "name", "type": "text"},
Expand All @@ -206,6 +207,7 @@ Which returns:
]
}
----
// TESTRESPONSE[s/"took": 28/"took": "$body.took"/]

[discrete]
[[esql-locale-param]]
Expand Down Expand Up @@ -384,12 +386,13 @@ GET /_query/async/FmNJRUZ1YWZCU3dHY1BIOUhaenVSRkEaaXFlZ3h4c1RTWFNocDdnY2FSaERnUT
// TEST[skip: no access to query ID - may return response values]

If the response's `is_running` value is `false`, the query has finished
and the results are returned.
and the results are returned, along with the `took` time for the query.

[source,console-result]
----
{
"is_running": false,
"took": 48,
"columns": ...
}
----
Expand Down
16 changes: 15 additions & 1 deletion docs/reference/esql/multivalued-fields.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ Multivalued fields come back as a JSON array:
[source,console-result]
----
{
"took": 28,
"columns": [
{ "name": "a", "type": "long"},
{ "name": "b", "type": "long"}
Expand All @@ -36,6 +37,8 @@ Multivalued fields come back as a JSON array:
]
}
----
// TESTRESPONSE[s/"took": 28/"took": "$body.took"/]


The relative order of values in a multivalued field is undefined. They'll frequently be in
ascending order but don't rely on that.
Expand Down Expand Up @@ -74,6 +77,7 @@ And {esql} sees that removal:
[source,console-result]
----
{
"took": 28,
"columns": [
{ "name": "a", "type": "long"},
{ "name": "b", "type": "keyword"}
Expand All @@ -84,6 +88,8 @@ And {esql} sees that removal:
]
}
----
// TESTRESPONSE[s/"took": 28/"took": "$body.took"/]


But other types, like `long` don't remove duplicates.

Expand Down Expand Up @@ -115,6 +121,7 @@ And {esql} also sees that:
[source,console-result]
----
{
"took": 28,
"columns": [
{ "name": "a", "type": "long"},
{ "name": "b", "type": "long"}
Expand All @@ -125,6 +132,8 @@ And {esql} also sees that:
]
}
----
// TESTRESPONSE[s/"took": 28/"took": "$body.took"/]


This is all at the storage layer. If you store duplicate `long`s and then
convert them to strings the duplicates will stay:
Expand Down Expand Up @@ -155,6 +164,7 @@ POST /_query
[source,console-result]
----
{
"took": 28,
"columns": [
{ "name": "a", "type": "long"},
{ "name": "b", "type": "keyword"}
Expand All @@ -165,6 +175,7 @@ POST /_query
]
}
----
// TESTRESPONSE[s/"took": 28/"took": "$body.took"/]

[discrete]
[[esql-multivalued-fields-functions]]
Expand Down Expand Up @@ -198,6 +209,7 @@ POST /_query
[source,console-result]
----
{
"took": 28,
"columns": [
{ "name": "a", "type": "long"},
{ "name": "b", "type": "long"},
Expand All @@ -210,6 +222,7 @@ POST /_query
]
}
----
// TESTRESPONSE[s/"took": 28/"took": "$body.took"/]

Work around this limitation by converting the field to single value with one of:

Expand All @@ -233,6 +246,7 @@ POST /_query
[source,console-result]
----
{
"took": 28,
"columns": [
{ "name": "a", "type": "long"},
{ "name": "b", "type": "long"},
Expand All @@ -245,4 +259,4 @@ POST /_query
]
}
----

// TESTRESPONSE[s/"took": 28/"took": "$body.took"/]
26 changes: 25 additions & 1 deletion server/src/main/java/org/elasticsearch/ExceptionsHelper.java
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,9 @@
import org.elasticsearch.core.Nullable;
import org.elasticsearch.index.Index;
import org.elasticsearch.rest.RestStatus;
import org.elasticsearch.transport.ConnectTransportException;
import org.elasticsearch.transport.NoSeedNodeLeftException;
import org.elasticsearch.transport.NoSuchRemoteClusterException;
import org.elasticsearch.xcontent.XContentParseException;

import java.io.IOException;
Expand Down Expand Up @@ -471,7 +474,7 @@ public static ShardOperationFailedException[] groupBy(ShardOperationFailedExcept
}

/**
* Utility method useful for determine whether to log an Exception or perhaps
* Utility method useful for determining whether to log an Exception or perhaps
* avoid logging a stacktrace if the caller/logger is not interested in these
* types of node/shard issues.
*
Expand All @@ -489,6 +492,27 @@ public static boolean isNodeOrShardUnavailableTypeException(Throwable t) {
|| t instanceof org.elasticsearch.cluster.block.ClusterBlockException);
}

/**
* Checks the exception against a known list of exceptions that indicate a remote cluster
* cannot be connected to.
*
* @param e Exception to inspect
* @return true if the Exception is known to indicate that a remote cluster
* is unavailable (cannot be connected to by the transport layer)
*/
public static boolean isRemoteUnavailableException(Exception e) {
Throwable unwrap = unwrap(e, ConnectTransportException.class, NoSuchRemoteClusterException.class, NoSeedNodeLeftException.class);
if (unwrap != null) {
return true;
}
Throwable ill = unwrap(e, IllegalStateException.class, IllegalArgumentException.class);
if (ill != null && (ill.getMessage().contains("Unable to open any connections") || ill.getMessage().contains("unknown host"))) {
return true;
}
// doesn't look like any of the known remote exceptions
return false;
}

private static class GroupBy {
final String reason;
final String index;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -209,6 +209,7 @@ static TransportVersion def(int id) {
public static final TransportVersion CCS_TELEMETRY_STATS = def(8_739_00_0);
public static final TransportVersion GLOBAL_RETENTION_TELEMETRY = def(8_740_00_0);
public static final TransportVersion ROUTING_TABLE_VERSION_REMOVED = def(8_741_00_0);
public static final TransportVersion ESQL_CCS_COMPUTE_RESPONSE = def(8_742_00_0);

/*
* STOP! READ THIS FIRST! No, really,
Expand Down
Loading