Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement remote cluster CCS telemetry #112478

Merged
merged 38 commits into from
Sep 30, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
6092acf
Add remote cluster stats to _cluster/stats - phase 1
smalyshev Aug 29, 2024
c6e7c04
Implement remote cluster stats polling
smalyshev Sep 3, 2024
fc0b06a
Improve failure handling
smalyshev Sep 4, 2024
16d0526
Parallelize remotes fetching
smalyshev Sep 5, 2024
172c473
Add new action to non-operator list
smalyshev Sep 5, 2024
fb03fea
Add docs for the include_remotes part
smalyshev Sep 5, 2024
3aa0bbd
Add tests
smalyshev Sep 6, 2024
8b86ac4
re-fix the name
smalyshev Sep 10, 2024
1990c25
ws
smalyshev Sep 10, 2024
8ad91b2
Create separate class for remote request - seems cleaner this way
smalyshev Sep 11, 2024
e77eafc
Add capabilities for stats depending on the flag for now
smalyshev Sep 11, 2024
b883a0b
Split remote handler into two - HandledAction and TransportNodesAction
smalyshev Sep 11, 2024
47a7c4c
Refactoring - eliminate class split and make TransportClusterStatsAct…
smalyshev Sep 12, 2024
b83818d
Refactor TransportClusterStatsAction - should not use field for the f…
smalyshev Sep 12, 2024
75854ba
Refactor the code to eliminate blocking wait
smalyshev Sep 13, 2024
29eecfa
Refactor remote stats with using CancellableFanout
smalyshev Sep 13, 2024
cf5e2ec
Add parent task it to requests
smalyshev Sep 13, 2024
630637e
Refactor listener to simplify
smalyshev Sep 13, 2024
76866cf
Pull feedback
smalyshev Sep 16, 2024
4a5fb64
Rm it from constants
smalyshev Sep 16, 2024
24708cc
Merge branch 'main' into ccs-telemetry-remotes
elasticmachine Sep 16, 2024
83f6bb1
Update the licenses
smalyshev Sep 16, 2024
bfc7cca
Add roundtrip REST YAML test
smalyshev Sep 16, 2024
5bc72fd
Add check for remote transport
smalyshev Sep 16, 2024
ce1cf6c
Implement human option in bytes and drop nodes on remote request
smalyshev Sep 17, 2024
0b1d20f
Give it a little longer to boot up
smalyshev Sep 17, 2024
17b144f
Merge branch 'main' into ccs-telemetry-remotes
smalyshev Sep 19, 2024
a07206e
Update TransportClusterStatsAction code with new action context code
smalyshev Sep 19, 2024
98b8541
Merge branch 'main' into ccs-telemetry-remotes
elasticmachine Sep 19, 2024
fbf1ca1
Pull feedback & refactoring
smalyshev Sep 20, 2024
40090bc
Add handling of possible exception
smalyshev Sep 20, 2024
4daaaea
Revert "Add handling of possible exception"
smalyshev Sep 20, 2024
0f6c212
Add new version for RemoteClusterStatsRequest
smalyshev Sep 21, 2024
561d82e
Merge branch 'main' into ccs-telemetry-remotes
smalyshev Sep 21, 2024
470c86c
Update for docs feedback
smalyshev Sep 23, 2024
5ba37ca
Merge branch 'main' into ccs-telemetry-remotes
smalyshev Sep 26, 2024
2105c87
Do not sent RemoteClusterStatsRequest to old clusters
smalyshev Sep 26, 2024
95f9a78
Merge branch 'main' into ccs-telemetry-remotes
smalyshev Sep 30, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
124 changes: 114 additions & 10 deletions docs/reference/cluster/stats.asciidoc
smalyshev marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,10 @@ If a node does not respond before its timeout expires, the response does not inc
However, timed out nodes are included in the response's `_nodes.failed` property.
Defaults to no timeout.

`include_remotes`::
(Optional, Boolean) If `true`, includes remote cluster information in the response.
Defaults to `false`, so no remote cluster information is returned.

[role="child_attributes"]
[[cluster-stats-api-response-body]]
==== {api-response-body-title}
Expand Down Expand Up @@ -183,12 +187,11 @@ This number is based on documents in Lucene segments and may include documents f
This number is based on documents in Lucene segments. {es} reclaims the disk space of deleted Lucene documents when a segment is merged.

`total_size_in_bytes`::
(integer)
Total size in bytes across all primary shards assigned to selected nodes.
(integer) Total size in bytes across all primary shards assigned to selected nodes.

`total_size`::
(string)
Total size across all primary shards assigned to selected nodes, as a human-readable string.
(string) Total size across all primary shards assigned to selected nodes, as a human-readable string.

=====

`store`::
Expand Down Expand Up @@ -1285,8 +1288,7 @@ They are included here for expert users, but should otherwise be ignored.
====

`repositories`::
(object) Contains statistics about the <<snapshot-restore,snapshot>> repositories defined in the cluster, broken down
by repository type.
(object) Contains statistics about the <<snapshot-restore,snapshot>> repositories defined in the cluster, broken down by repository type.
+
.Properties of `repositories`
[%collapsible%open]
Expand Down Expand Up @@ -1314,13 +1316,74 @@ Each repository type may also include other statistics about the repositories of
[%collapsible%open]
=====

`clusters`:::
(object) Contains remote cluster settings and metrics collected from them.
The keys are cluster names, and the values are per-cluster data.
Only present if `include_remotes` option is set to `true`.

+
.Properties of `clusters`
[%collapsible%open]
======

`cluster_uuid`:::
(string) The UUID of the remote cluster.

`mode`:::
(string) The <<sniff-proxy-modes, connection mode>> used to communicate with the remote cluster.

`skip_unavailable`:::
(Boolean) The `skip_unavailable` <<skip-unavailable-clusters, setting>> used for this remote cluster.

`transport.compress`:::
(string) Transport compression setting used for this remote cluster.

`version`:::
(array of strings) The list of {es} versions used by the nodes on the remote cluster.

`status`:::
include::{es-ref-dir}/rest-api/common-parms.asciidoc[tag=cluster-health-status]
+
See <<cluster-health>>.

`nodes_count`:::
(integer) The total count of nodes in the remote cluster.

`shards_count`:::
(integer) The total number of shards in the remote cluster.

`indices_count`:::
(integer) The total number of indices in the remote cluster.

`indices_total_size_in_bytes`:::
(integer) Total data set size, in bytes, of all shards assigned to selected nodes.

`indices_total_size`:::
(string) Total data set size, in bytes, of all shards assigned to selected nodes, as a human-readable string.

`max_heap_in_bytes`:::
(integer) Maximum amount of memory, in bytes, available for use by the heap across the nodes of the remote cluster.

`max_heap`:::
(string) Maximum amount of memory, in bytes, available for use by the heap across the nodes of the remote cluster,
as a human-readable string.

`mem_total_in_bytes`:::
(integer) Total amount, in bytes, of physical memory across the nodes of the remote cluster.

`mem_total`:::
(string) Total amount, in bytes, of physical memory across the nodes of the remote cluster, as a human-readable string.

======


`_search`:::
(object) Contains the telemetry information about the <<modules-cross-cluster-search, {ccs}>> usage in the cluster.
(object) Contains the information about the <<modules-cross-cluster-search, {ccs}>> usage in the cluster.
+
.Properties of `_search`
[%collapsible%open]
======

`total`:::
(integer) The total number of {ccs} requests that have been executed by the cluster.

Expand All @@ -1336,6 +1399,7 @@ Each repository type may also include other statistics about the repositories of
.Properties of `took`
[%collapsible%open]
=======

`max`:::
(integer) The maximum time taken to execute a {ccs} request, in milliseconds.

Expand All @@ -1344,6 +1408,7 @@ Each repository type may also include other statistics about the repositories of

`p90`:::
(integer) The 90th percentile of the time taken to execute {ccs} requests, in milliseconds.

=======

`took_mrt_true`::
Expand All @@ -1361,6 +1426,7 @@ Each repository type may also include other statistics about the repositories of

`p90`:::
(integer) The 90th percentile of the time taken to execute {ccs} requests, in milliseconds.

=======

`took_mrt_false`::
Expand All @@ -1378,6 +1444,7 @@ Each repository type may also include other statistics about the repositories of

`p90`:::
(integer) The 90th percentile of the time taken to execute {ccs} requests, in milliseconds.

=======

`remotes_per_search_max`::
Expand All @@ -1391,9 +1458,10 @@ Each repository type may also include other statistics about the repositories of
The keys are the failure reason names and the values are the number of requests that failed for that reason.

`features`::
(object) Contains statistics about the features used in {ccs} requests. The keys are the names of the search feature,
and the values are the number of requests that used that feature. Single request can use more than one feature
(e.g. both `async` and `wildcard`). Known features are:
(object) Contains statistics about the features used in {ccs} requests.
The keys are the names of the search feature, and the values are the number of requests that used that feature.
Single request can use more than one feature (e.g. both `async` and `wildcard`).
Known features are:

* `async` - <<async-search, Async search>>

Expand Down Expand Up @@ -1427,6 +1495,7 @@ This may include requests where partial results were returned, but not requests
.Properties of `took`
[%collapsible%open]
========

`max`:::
(integer) The maximum time taken to execute a {ccs} request, in milliseconds.

Expand All @@ -1435,6 +1504,7 @@ This may include requests where partial results were returned, but not requests

`p90`:::
(integer) The 90th percentile of the time taken to execute {ccs} requests, in milliseconds.

========

=======
Expand Down Expand Up @@ -1812,3 +1882,37 @@ This API can be restricted to a subset of the nodes using <<cluster-nodes,node f
--------------------------------------------------
GET /_cluster/stats/nodes/node1,node*,master:false
--------------------------------------------------

This API call will return data about the remote clusters if any are configured:

[source,console]
--------------------------------------------------
GET /_cluster/stats?include_remotes=true
--------------------------------------------------

The resulting response will contain the `ccs` object with information about the remote clusters:

[source,js]
--------------------------------------------------
{
"ccs": {
"clusters": {
"remote_cluster": {
"cluster_uuid": "YjAvIhsCQ9CbjWZb2qJw3Q",
"mode": "sniff",
"skip_unavailable": false,
"transport.compress": "true",
"version": ["8.16.0"],
"status": "green",
"nodes_count": 10,
"shards_count": 420,
"indices_count": 10,
"indices_total_size_in_bytes": 6232658362,
"max_heap_in_bytes": 1037959168,
"mem_total_in_bytes": 137438953472
}
}
}
}
--------------------------------------------------
// TESTRESPONSE[skip:TODO]
Original file line number Diff line number Diff line change
Expand Up @@ -32,9 +32,9 @@
]
},
"params":{
"flat_settings":{
"include_remotes":{
"type":"boolean",
"description":"Return settings in flat format (default: false)"
"description":"Include remote cluster data into the response (default: false)"
},
"timeout":{
"type":"time",
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,151 @@
---
"cross-cluster search stats basic":
- requires:
test_runner_features: [ capabilities ]
capabilities:
- method: GET
path: /_cluster/stats
capabilities:
- "ccs-stats"
reason: "Capability required to run test"

- do:
cluster.stats: { }

- is_true: ccs
- is_true: ccs._search
- is_false: ccs.clusters # no ccs clusters configured
- exists: ccs._search.total
- exists: ccs._search.success
- exists: ccs._search.skipped
- is_true: ccs._search.took
- is_true: ccs._search.took_mrt_true
- is_true: ccs._search.took_mrt_false
- exists: ccs._search.remotes_per_search_max
- exists: ccs._search.remotes_per_search_avg
- exists: ccs._search.failure_reasons
- exists: ccs._search.features
- exists: ccs._search.clients
- exists: ccs._search.clusters

---
"cross-cluster search stats search":
- requires:
test_runner_features: [ capabilities ]
capabilities:
- method: GET
path: /_cluster/stats
capabilities:
- "ccs-stats"
reason: "Capability required to run test"

- do:
cluster.state: {}
- set: { master_node: master }
- do:
nodes.info:
metric: [ http, transport ]
- set: {nodes.$master.http.publish_address: host}
- set: {nodes.$master.transport.publish_address: transport_host}

- do:
cluster.put_settings:
body:
persistent:
cluster:
remote:
cluster_one:
seeds:
- "${transport_host}"
skip_unavailable: true
cluster_two:
seeds:
- "${transport_host}"
skip_unavailable: false
- is_true: persistent.cluster.remote.cluster_one

- do:
indices.create:
index: test
body:
settings:
number_of_replicas: 0

- do:
index:
index: test
id: "1"
refresh: true
body:
foo: bar

- do:
cluster.health:
wait_for_status: green

- do:
search:
index: "*,*:*"
body:
query:
match:
foo: bar

- do:
cluster.stats: {}
- is_true: ccs
- is_true: ccs._search
- is_false: ccs.clusters # Still no remotes since include_remotes is not set

- do:
cluster.stats:
include_remotes: true
- is_true: ccs
- is_true: ccs._search
- is_true: ccs.clusters # Now we have remotes
- is_true: ccs.clusters.cluster_one
- is_true: ccs.clusters.cluster_two
- is_true: ccs.clusters.cluster_one.cluster_uuid
- match: { ccs.clusters.cluster_one.mode: sniff }
- match: { ccs.clusters.cluster_one.skip_unavailable: true }
- match: { ccs.clusters.cluster_two.skip_unavailable: false }
- is_true: ccs.clusters.cluster_one.version
- match: { ccs.clusters.cluster_one.status: green }
- match: { ccs.clusters.cluster_two.status: green }
- is_true: ccs.clusters.cluster_one.nodes_count
- is_true: ccs.clusters.cluster_one.shards_count
- is_true: ccs.clusters.cluster_one.indices_count
- is_true: ccs.clusters.cluster_one.indices_total_size_in_bytes
- is_true: ccs.clusters.cluster_one.max_heap_in_bytes
- is_true: ccs.clusters.cluster_one.mem_total_in_bytes
- is_true: ccs._search.total
- is_true: ccs._search.success
- exists: ccs._search.skipped
- is_true: ccs._search.took
- is_true: ccs._search.took.max
- is_true: ccs._search.took.avg
- is_true: ccs._search.took.p90
- is_true: ccs._search.took_mrt_true
- exists: ccs._search.took_mrt_true.max
- exists: ccs._search.took_mrt_true.avg
- exists: ccs._search.took_mrt_true.p90
- is_true: ccs._search.took_mrt_false
- exists: ccs._search.took_mrt_false.max
- exists: ccs._search.took_mrt_false.avg
- exists: ccs._search.took_mrt_false.p90
- match: { ccs._search.remotes_per_search_max: 2 }
- match: { ccs._search.remotes_per_search_avg: 2.0 }
- exists: ccs._search.failure_reasons
- exists: ccs._search.features
- exists: ccs._search.clients
- is_true: ccs._search.clusters
- is_true: ccs._search.clusters.cluster_one
- is_true: ccs._search.clusters.cluster_two
- gte: {ccs._search.clusters.cluster_one.total: 1}
- gte: {ccs._search.clusters.cluster_two.total: 1}
- exists: ccs._search.clusters.cluster_one.skipped
- exists: ccs._search.clusters.cluster_two.skipped
- is_true: ccs._search.clusters.cluster_one.took
- is_true: ccs._search.clusters.cluster_one.took.max
- is_true: ccs._search.clusters.cluster_one.took.avg
- is_true: ccs._search.clusters.cluster_one.took.p90
Loading