Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[8.x] Implement remote cluster CCS telemetry (#112478) #113814

Merged
merged 1 commit into from
Sep 30, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
124 changes: 114 additions & 10 deletions docs/reference/cluster/stats.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,10 @@ If a node does not respond before its timeout expires, the response does not inc
However, timed out nodes are included in the response's `_nodes.failed` property.
Defaults to no timeout.

`include_remotes`::
(Optional, Boolean) If `true`, includes remote cluster information in the response.
Defaults to `false`, so no remote cluster information is returned.

[role="child_attributes"]
[[cluster-stats-api-response-body]]
==== {api-response-body-title}
Expand Down Expand Up @@ -183,12 +187,11 @@ This number is based on documents in Lucene segments and may include documents f
This number is based on documents in Lucene segments. {es} reclaims the disk space of deleted Lucene documents when a segment is merged.

`total_size_in_bytes`::
(integer)
Total size in bytes across all primary shards assigned to selected nodes.
(integer) Total size in bytes across all primary shards assigned to selected nodes.

`total_size`::
(string)
Total size across all primary shards assigned to selected nodes, as a human-readable string.
(string) Total size across all primary shards assigned to selected nodes, as a human-readable string.

=====

`store`::
Expand Down Expand Up @@ -1285,8 +1288,7 @@ They are included here for expert users, but should otherwise be ignored.
====

`repositories`::
(object) Contains statistics about the <<snapshot-restore,snapshot>> repositories defined in the cluster, broken down
by repository type.
(object) Contains statistics about the <<snapshot-restore,snapshot>> repositories defined in the cluster, broken down by repository type.
+
.Properties of `repositories`
[%collapsible%open]
Expand Down Expand Up @@ -1314,13 +1316,74 @@ Each repository type may also include other statistics about the repositories of
[%collapsible%open]
=====

`clusters`:::
(object) Contains remote cluster settings and metrics collected from them.
The keys are cluster names, and the values are per-cluster data.
Only present if `include_remotes` option is set to `true`.

+
.Properties of `clusters`
[%collapsible%open]
======

`cluster_uuid`:::
(string) The UUID of the remote cluster.

`mode`:::
(string) The <<sniff-proxy-modes, connection mode>> used to communicate with the remote cluster.

`skip_unavailable`:::
(Boolean) The `skip_unavailable` <<skip-unavailable-clusters, setting>> used for this remote cluster.

`transport.compress`:::
(string) Transport compression setting used for this remote cluster.

`version`:::
(array of strings) The list of {es} versions used by the nodes on the remote cluster.

`status`:::
include::{es-ref-dir}/rest-api/common-parms.asciidoc[tag=cluster-health-status]
+
See <<cluster-health>>.

`nodes_count`:::
(integer) The total count of nodes in the remote cluster.

`shards_count`:::
(integer) The total number of shards in the remote cluster.

`indices_count`:::
(integer) The total number of indices in the remote cluster.

`indices_total_size_in_bytes`:::
(integer) Total data set size, in bytes, of all shards assigned to selected nodes.

`indices_total_size`:::
(string) Total data set size, in bytes, of all shards assigned to selected nodes, as a human-readable string.

`max_heap_in_bytes`:::
(integer) Maximum amount of memory, in bytes, available for use by the heap across the nodes of the remote cluster.

`max_heap`:::
(string) Maximum amount of memory, in bytes, available for use by the heap across the nodes of the remote cluster,
as a human-readable string.

`mem_total_in_bytes`:::
(integer) Total amount, in bytes, of physical memory across the nodes of the remote cluster.

`mem_total`:::
(string) Total amount, in bytes, of physical memory across the nodes of the remote cluster, as a human-readable string.

======


`_search`:::
(object) Contains the telemetry information about the <<modules-cross-cluster-search, {ccs}>> usage in the cluster.
(object) Contains the information about the <<modules-cross-cluster-search, {ccs}>> usage in the cluster.
+
.Properties of `_search`
[%collapsible%open]
======

`total`:::
(integer) The total number of {ccs} requests that have been executed by the cluster.

Expand All @@ -1336,6 +1399,7 @@ Each repository type may also include other statistics about the repositories of
.Properties of `took`
[%collapsible%open]
=======

`max`:::
(integer) The maximum time taken to execute a {ccs} request, in milliseconds.

Expand All @@ -1344,6 +1408,7 @@ Each repository type may also include other statistics about the repositories of

`p90`:::
(integer) The 90th percentile of the time taken to execute {ccs} requests, in milliseconds.

=======

`took_mrt_true`::
Expand All @@ -1361,6 +1426,7 @@ Each repository type may also include other statistics about the repositories of

`p90`:::
(integer) The 90th percentile of the time taken to execute {ccs} requests, in milliseconds.

=======

`took_mrt_false`::
Expand All @@ -1378,6 +1444,7 @@ Each repository type may also include other statistics about the repositories of

`p90`:::
(integer) The 90th percentile of the time taken to execute {ccs} requests, in milliseconds.

=======

`remotes_per_search_max`::
Expand All @@ -1391,9 +1458,10 @@ Each repository type may also include other statistics about the repositories of
The keys are the failure reason names and the values are the number of requests that failed for that reason.

`features`::
(object) Contains statistics about the features used in {ccs} requests. The keys are the names of the search feature,
and the values are the number of requests that used that feature. Single request can use more than one feature
(e.g. both `async` and `wildcard`). Known features are:
(object) Contains statistics about the features used in {ccs} requests.
The keys are the names of the search feature, and the values are the number of requests that used that feature.
Single request can use more than one feature (e.g. both `async` and `wildcard`).
Known features are:

* `async` - <<async-search, Async search>>

Expand Down Expand Up @@ -1427,6 +1495,7 @@ This may include requests where partial results were returned, but not requests
.Properties of `took`
[%collapsible%open]
========

`max`:::
(integer) The maximum time taken to execute a {ccs} request, in milliseconds.

Expand All @@ -1435,6 +1504,7 @@ This may include requests where partial results were returned, but not requests

`p90`:::
(integer) The 90th percentile of the time taken to execute {ccs} requests, in milliseconds.

========

=======
Expand Down Expand Up @@ -1812,3 +1882,37 @@ This API can be restricted to a subset of the nodes using <<cluster-nodes,node f
--------------------------------------------------
GET /_cluster/stats/nodes/node1,node*,master:false
--------------------------------------------------

This API call will return data about the remote clusters if any are configured:

[source,console]
--------------------------------------------------
GET /_cluster/stats?include_remotes=true
--------------------------------------------------

The resulting response will contain the `ccs` object with information about the remote clusters:

[source,js]
--------------------------------------------------
{
"ccs": {
"clusters": {
"remote_cluster": {
"cluster_uuid": "YjAvIhsCQ9CbjWZb2qJw3Q",
"mode": "sniff",
"skip_unavailable": false,
"transport.compress": "true",
"version": ["8.16.0"],
"status": "green",
"nodes_count": 10,
"shards_count": 420,
"indices_count": 10,
"indices_total_size_in_bytes": 6232658362,
"max_heap_in_bytes": 1037959168,
"mem_total_in_bytes": 137438953472
}
}
}
}
--------------------------------------------------
// TESTRESPONSE[skip:TODO]
Original file line number Diff line number Diff line change
Expand Up @@ -32,9 +32,9 @@
]
},
"params":{
"flat_settings":{
"include_remotes":{
"type":"boolean",
"description":"Return settings in flat format (default: false)"
"description":"Include remote cluster data into the response (default: false)"
},
"timeout":{
"type":"time",
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,151 @@
---
"cross-cluster search stats basic":
- requires:
test_runner_features: [ capabilities ]
capabilities:
- method: GET
path: /_cluster/stats
capabilities:
- "ccs-stats"
reason: "Capability required to run test"

- do:
cluster.stats: { }

- is_true: ccs
- is_true: ccs._search
- is_false: ccs.clusters # no ccs clusters configured
- exists: ccs._search.total
- exists: ccs._search.success
- exists: ccs._search.skipped
- is_true: ccs._search.took
- is_true: ccs._search.took_mrt_true
- is_true: ccs._search.took_mrt_false
- exists: ccs._search.remotes_per_search_max
- exists: ccs._search.remotes_per_search_avg
- exists: ccs._search.failure_reasons
- exists: ccs._search.features
- exists: ccs._search.clients
- exists: ccs._search.clusters

---
"cross-cluster search stats search":
- requires:
test_runner_features: [ capabilities ]
capabilities:
- method: GET
path: /_cluster/stats
capabilities:
- "ccs-stats"
reason: "Capability required to run test"

- do:
cluster.state: {}
- set: { master_node: master }
- do:
nodes.info:
metric: [ http, transport ]
- set: {nodes.$master.http.publish_address: host}
- set: {nodes.$master.transport.publish_address: transport_host}

- do:
cluster.put_settings:
body:
persistent:
cluster:
remote:
cluster_one:
seeds:
- "${transport_host}"
skip_unavailable: true
cluster_two:
seeds:
- "${transport_host}"
skip_unavailable: false
- is_true: persistent.cluster.remote.cluster_one

- do:
indices.create:
index: test
body:
settings:
number_of_replicas: 0

- do:
index:
index: test
id: "1"
refresh: true
body:
foo: bar

- do:
cluster.health:
wait_for_status: green

- do:
search:
index: "*,*:*"
body:
query:
match:
foo: bar

- do:
cluster.stats: {}
- is_true: ccs
- is_true: ccs._search
- is_false: ccs.clusters # Still no remotes since include_remotes is not set

- do:
cluster.stats:
include_remotes: true
- is_true: ccs
- is_true: ccs._search
- is_true: ccs.clusters # Now we have remotes
- is_true: ccs.clusters.cluster_one
- is_true: ccs.clusters.cluster_two
- is_true: ccs.clusters.cluster_one.cluster_uuid
- match: { ccs.clusters.cluster_one.mode: sniff }
- match: { ccs.clusters.cluster_one.skip_unavailable: true }
- match: { ccs.clusters.cluster_two.skip_unavailable: false }
- is_true: ccs.clusters.cluster_one.version
- match: { ccs.clusters.cluster_one.status: green }
- match: { ccs.clusters.cluster_two.status: green }
- is_true: ccs.clusters.cluster_one.nodes_count
- is_true: ccs.clusters.cluster_one.shards_count
- is_true: ccs.clusters.cluster_one.indices_count
- is_true: ccs.clusters.cluster_one.indices_total_size_in_bytes
- is_true: ccs.clusters.cluster_one.max_heap_in_bytes
- is_true: ccs.clusters.cluster_one.mem_total_in_bytes
- is_true: ccs._search.total
- is_true: ccs._search.success
- exists: ccs._search.skipped
- is_true: ccs._search.took
- is_true: ccs._search.took.max
- is_true: ccs._search.took.avg
- is_true: ccs._search.took.p90
- is_true: ccs._search.took_mrt_true
- exists: ccs._search.took_mrt_true.max
- exists: ccs._search.took_mrt_true.avg
- exists: ccs._search.took_mrt_true.p90
- is_true: ccs._search.took_mrt_false
- exists: ccs._search.took_mrt_false.max
- exists: ccs._search.took_mrt_false.avg
- exists: ccs._search.took_mrt_false.p90
- match: { ccs._search.remotes_per_search_max: 2 }
- match: { ccs._search.remotes_per_search_avg: 2.0 }
- exists: ccs._search.failure_reasons
- exists: ccs._search.features
- exists: ccs._search.clients
- is_true: ccs._search.clusters
- is_true: ccs._search.clusters.cluster_one
- is_true: ccs._search.clusters.cluster_two
- gte: {ccs._search.clusters.cluster_one.total: 1}
- gte: {ccs._search.clusters.cluster_two.total: 1}
- exists: ccs._search.clusters.cluster_one.skipped
- exists: ccs._search.clusters.cluster_two.skipped
- is_true: ccs._search.clusters.cluster_one.took
- is_true: ccs._search.clusters.cluster_one.took.max
- is_true: ccs._search.clusters.cluster_one.took.avg
- is_true: ccs._search.clusters.cluster_one.took.p90
Loading
Loading