Skip to content

Commit

Permalink
Implement remote cluster CCS telemetry (#112478)
Browse files Browse the repository at this point in the history
* Add remote cluster stats to _cluster/stats
* Implement remote cluster stats polling
* Add docs for the include_remotes part
  • Loading branch information
smalyshev committed Sep 30, 2024
1 parent e57fc24 commit b26d81c
Show file tree
Hide file tree
Showing 15 changed files with 939 additions and 54 deletions.
124 changes: 114 additions & 10 deletions docs/reference/cluster/stats.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,10 @@ If a node does not respond before its timeout expires, the response does not inc
However, timed out nodes are included in the response's `_nodes.failed` property.
Defaults to no timeout.

`include_remotes`::
(Optional, Boolean) If `true`, includes remote cluster information in the response.
Defaults to `false`, so no remote cluster information is returned.

[role="child_attributes"]
[[cluster-stats-api-response-body]]
==== {api-response-body-title}
Expand Down Expand Up @@ -183,12 +187,11 @@ This number is based on documents in Lucene segments and may include documents f
This number is based on documents in Lucene segments. {es} reclaims the disk space of deleted Lucene documents when a segment is merged.

`total_size_in_bytes`::
(integer)
Total size in bytes across all primary shards assigned to selected nodes.
(integer) Total size in bytes across all primary shards assigned to selected nodes.

`total_size`::
(string)
Total size across all primary shards assigned to selected nodes, as a human-readable string.
(string) Total size across all primary shards assigned to selected nodes, as a human-readable string.

=====
`store`::
Expand Down Expand Up @@ -1285,8 +1288,7 @@ They are included here for expert users, but should otherwise be ignored.
====
`repositories`::
(object) Contains statistics about the <<snapshot-restore,snapshot>> repositories defined in the cluster, broken down
by repository type.
(object) Contains statistics about the <<snapshot-restore,snapshot>> repositories defined in the cluster, broken down by repository type.
+
.Properties of `repositories`
[%collapsible%open]
Expand Down Expand Up @@ -1314,13 +1316,74 @@ Each repository type may also include other statistics about the repositories of
[%collapsible%open]
=====
`clusters`:::
(object) Contains remote cluster settings and metrics collected from them.
The keys are cluster names, and the values are per-cluster data.
Only present if `include_remotes` option is set to `true`.
+
.Properties of `clusters`
[%collapsible%open]
======

`cluster_uuid`:::
(string) The UUID of the remote cluster.

`mode`:::
(string) The <<sniff-proxy-modes, connection mode>> used to communicate with the remote cluster.

`skip_unavailable`:::
(Boolean) The `skip_unavailable` <<skip-unavailable-clusters, setting>> used for this remote cluster.

`transport.compress`:::
(string) Transport compression setting used for this remote cluster.

`version`:::
(array of strings) The list of {es} versions used by the nodes on the remote cluster.

`status`:::
include::{es-ref-dir}/rest-api/common-parms.asciidoc[tag=cluster-health-status]
+
See <<cluster-health>>.

`nodes_count`:::
(integer) The total count of nodes in the remote cluster.

`shards_count`:::
(integer) The total number of shards in the remote cluster.

`indices_count`:::
(integer) The total number of indices in the remote cluster.

`indices_total_size_in_bytes`:::
(integer) Total data set size, in bytes, of all shards assigned to selected nodes.

`indices_total_size`:::
(string) Total data set size, in bytes, of all shards assigned to selected nodes, as a human-readable string.

`max_heap_in_bytes`:::
(integer) Maximum amount of memory, in bytes, available for use by the heap across the nodes of the remote cluster.

`max_heap`:::
(string) Maximum amount of memory, in bytes, available for use by the heap across the nodes of the remote cluster,
as a human-readable string.

`mem_total_in_bytes`:::
(integer) Total amount, in bytes, of physical memory across the nodes of the remote cluster.

`mem_total`:::
(string) Total amount, in bytes, of physical memory across the nodes of the remote cluster, as a human-readable string.

======
`_search`:::
(object) Contains the telemetry information about the <<modules-cross-cluster-search, {ccs}>> usage in the cluster.
(object) Contains the information about the <<modules-cross-cluster-search, {ccs}>> usage in the cluster.
+
.Properties of `_search`
[%collapsible%open]
======

`total`:::
(integer) The total number of {ccs} requests that have been executed by the cluster.

Expand All @@ -1336,6 +1399,7 @@ Each repository type may also include other statistics about the repositories of
.Properties of `took`
[%collapsible%open]
=======
`max`:::
(integer) The maximum time taken to execute a {ccs} request, in milliseconds.
Expand All @@ -1344,6 +1408,7 @@ Each repository type may also include other statistics about the repositories of
`p90`:::
(integer) The 90th percentile of the time taken to execute {ccs} requests, in milliseconds.
=======

`took_mrt_true`::
Expand All @@ -1361,6 +1426,7 @@ Each repository type may also include other statistics about the repositories of
`p90`:::
(integer) The 90th percentile of the time taken to execute {ccs} requests, in milliseconds.
=======

`took_mrt_false`::
Expand All @@ -1378,6 +1444,7 @@ Each repository type may also include other statistics about the repositories of
`p90`:::
(integer) The 90th percentile of the time taken to execute {ccs} requests, in milliseconds.
=======

`remotes_per_search_max`::
Expand All @@ -1391,9 +1458,10 @@ Each repository type may also include other statistics about the repositories of
The keys are the failure reason names and the values are the number of requests that failed for that reason.

`features`::
(object) Contains statistics about the features used in {ccs} requests. The keys are the names of the search feature,
and the values are the number of requests that used that feature. Single request can use more than one feature
(e.g. both `async` and `wildcard`). Known features are:
(object) Contains statistics about the features used in {ccs} requests.
The keys are the names of the search feature, and the values are the number of requests that used that feature.
Single request can use more than one feature (e.g. both `async` and `wildcard`).
Known features are:

* `async` - <<async-search, Async search>>

Expand Down Expand Up @@ -1427,6 +1495,7 @@ This may include requests where partial results were returned, but not requests
.Properties of `took`
[%collapsible%open]
========

`max`:::
(integer) The maximum time taken to execute a {ccs} request, in milliseconds.

Expand All @@ -1435,6 +1504,7 @@ This may include requests where partial results were returned, but not requests

`p90`:::
(integer) The 90th percentile of the time taken to execute {ccs} requests, in milliseconds.

========
=======
Expand Down Expand Up @@ -1812,3 +1882,37 @@ This API can be restricted to a subset of the nodes using <<cluster-nodes,node f
--------------------------------------------------
GET /_cluster/stats/nodes/node1,node*,master:false
--------------------------------------------------

This API call will return data about the remote clusters if any are configured:

[source,console]
--------------------------------------------------
GET /_cluster/stats?include_remotes=true
--------------------------------------------------

The resulting response will contain the `ccs` object with information about the remote clusters:

[source,js]
--------------------------------------------------
{
"ccs": {
"clusters": {
"remote_cluster": {
"cluster_uuid": "YjAvIhsCQ9CbjWZb2qJw3Q",
"mode": "sniff",
"skip_unavailable": false,
"transport.compress": "true",
"version": ["8.16.0"],
"status": "green",
"nodes_count": 10,
"shards_count": 420,
"indices_count": 10,
"indices_total_size_in_bytes": 6232658362,
"max_heap_in_bytes": 1037959168,
"mem_total_in_bytes": 137438953472
}
}
}
}
--------------------------------------------------
// TESTRESPONSE[skip:TODO]
Original file line number Diff line number Diff line change
Expand Up @@ -32,9 +32,9 @@
]
},
"params":{
"flat_settings":{
"include_remotes":{
"type":"boolean",
"description":"Return settings in flat format (default: false)"
"description":"Include remote cluster data into the response (default: false)"
},
"timeout":{
"type":"time",
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,151 @@
---
"cross-cluster search stats basic":
- requires:
test_runner_features: [ capabilities ]
capabilities:
- method: GET
path: /_cluster/stats
capabilities:
- "ccs-stats"
reason: "Capability required to run test"

- do:
cluster.stats: { }

- is_true: ccs
- is_true: ccs._search
- is_false: ccs.clusters # no ccs clusters configured
- exists: ccs._search.total
- exists: ccs._search.success
- exists: ccs._search.skipped
- is_true: ccs._search.took
- is_true: ccs._search.took_mrt_true
- is_true: ccs._search.took_mrt_false
- exists: ccs._search.remotes_per_search_max
- exists: ccs._search.remotes_per_search_avg
- exists: ccs._search.failure_reasons
- exists: ccs._search.features
- exists: ccs._search.clients
- exists: ccs._search.clusters

---
"cross-cluster search stats search":
- requires:
test_runner_features: [ capabilities ]
capabilities:
- method: GET
path: /_cluster/stats
capabilities:
- "ccs-stats"
reason: "Capability required to run test"

- do:
cluster.state: {}
- set: { master_node: master }
- do:
nodes.info:
metric: [ http, transport ]
- set: {nodes.$master.http.publish_address: host}
- set: {nodes.$master.transport.publish_address: transport_host}

- do:
cluster.put_settings:
body:
persistent:
cluster:
remote:
cluster_one:
seeds:
- "${transport_host}"
skip_unavailable: true
cluster_two:
seeds:
- "${transport_host}"
skip_unavailable: false
- is_true: persistent.cluster.remote.cluster_one

- do:
indices.create:
index: test
body:
settings:
number_of_replicas: 0

- do:
index:
index: test
id: "1"
refresh: true
body:
foo: bar

- do:
cluster.health:
wait_for_status: green

- do:
search:
index: "*,*:*"
body:
query:
match:
foo: bar

- do:
cluster.stats: {}
- is_true: ccs
- is_true: ccs._search
- is_false: ccs.clusters # Still no remotes since include_remotes is not set

- do:
cluster.stats:
include_remotes: true
- is_true: ccs
- is_true: ccs._search
- is_true: ccs.clusters # Now we have remotes
- is_true: ccs.clusters.cluster_one
- is_true: ccs.clusters.cluster_two
- is_true: ccs.clusters.cluster_one.cluster_uuid
- match: { ccs.clusters.cluster_one.mode: sniff }
- match: { ccs.clusters.cluster_one.skip_unavailable: true }
- match: { ccs.clusters.cluster_two.skip_unavailable: false }
- is_true: ccs.clusters.cluster_one.version
- match: { ccs.clusters.cluster_one.status: green }
- match: { ccs.clusters.cluster_two.status: green }
- is_true: ccs.clusters.cluster_one.nodes_count
- is_true: ccs.clusters.cluster_one.shards_count
- is_true: ccs.clusters.cluster_one.indices_count
- is_true: ccs.clusters.cluster_one.indices_total_size_in_bytes
- is_true: ccs.clusters.cluster_one.max_heap_in_bytes
- is_true: ccs.clusters.cluster_one.mem_total_in_bytes
- is_true: ccs._search.total
- is_true: ccs._search.success
- exists: ccs._search.skipped
- is_true: ccs._search.took
- is_true: ccs._search.took.max
- is_true: ccs._search.took.avg
- is_true: ccs._search.took.p90
- is_true: ccs._search.took_mrt_true
- exists: ccs._search.took_mrt_true.max
- exists: ccs._search.took_mrt_true.avg
- exists: ccs._search.took_mrt_true.p90
- is_true: ccs._search.took_mrt_false
- exists: ccs._search.took_mrt_false.max
- exists: ccs._search.took_mrt_false.avg
- exists: ccs._search.took_mrt_false.p90
- match: { ccs._search.remotes_per_search_max: 2 }
- match: { ccs._search.remotes_per_search_avg: 2.0 }
- exists: ccs._search.failure_reasons
- exists: ccs._search.features
- exists: ccs._search.clients
- is_true: ccs._search.clusters
- is_true: ccs._search.clusters.cluster_one
- is_true: ccs._search.clusters.cluster_two
- gte: {ccs._search.clusters.cluster_one.total: 1}
- gte: {ccs._search.clusters.cluster_two.total: 1}
- exists: ccs._search.clusters.cluster_one.skipped
- exists: ccs._search.clusters.cluster_two.skipped
- is_true: ccs._search.clusters.cluster_one.took
- is_true: ccs._search.clusters.cluster_one.took.max
- is_true: ccs._search.clusters.cluster_one.took.avg
- is_true: ccs._search.clusters.cluster_one.took.p90
Loading

0 comments on commit b26d81c

Please sign in to comment.