Skip to content

Commit

Permalink
Create a doc for versioning info (#113601)
Browse files Browse the repository at this point in the history
  • Loading branch information
thecoop committed Sep 30, 2024
1 parent 71c252c commit acd4f07
Show file tree
Hide file tree
Showing 2 changed files with 302 additions and 45 deletions.
50 changes: 5 additions & 45 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -660,51 +660,11 @@ node cannot continue to operate as a member of the cluster:

Errors like this should be very rare. When in doubt, prefer `WARN` to `ERROR`.

### Version numbers in the Elasticsearch codebase

Starting in 8.8.0, we have separated out the version number representations
of various aspects of Elasticsearch into their own classes, using their own
numbering scheme separate to release version. The main ones are
`TransportVersion` and `IndexVersion`, representing the version of the
inter-node binary protocol and index data + metadata respectively.

Separated version numbers are comprised of an integer number. The semantic
meaning of a version number are defined within each `*Version` class. There
is no direct mapping between separated version numbers and the release version.
The versions used by any particular instance of Elasticsearch can be obtained
by querying `/_nodes/info` on the node.

#### Using separated version numbers

Whenever a change is made to a component versioned using a separated version
number, there are a few rules that need to be followed:

1. Each version number represents a specific modification to that component,
and should not be modified once it is defined. Each version is immutable
once merged into `main`.
2. To create a new component version, add a new constant to the respective class
with a descriptive name of the change being made. Increment the integer
number according to the particular `*Version` class.

If your pull request has a conflict around your new version constant,
you need to update your PR from `main` and change your PR to use the next
available version number.

### Checking for cluster features

As part of developing a new feature or change, you might need to determine
if all nodes in a cluster have been upgraded to support your new feature.
This can be done using `FeatureService`. To define and check for a new
feature in a cluster:

1. Define a new `NodeFeature` constant with a unique id for the feature
in a class related to the change you're doing.
2. Return that constant from an instance of `FeatureSpecification.getFeatures`,
either an existing implementation or a new implementation. Make sure
the implementation is added as an SPI implementation in `module-info.java`
and `META-INF/services`.
3. To check if all nodes in the cluster support the new feature, call
`FeatureService.clusterHasFeature(ClusterState, NodeFeature)`
### Versioning Elasticsearch

There are various concepts used to identify running node versions,
and the capabilities and compatibility of those nodes. For more information,
see `docs/internal/Versioning.md`

### Creating a distribution

Expand Down
297 changes: 297 additions & 0 deletions docs/internal/Versioning.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,297 @@
Versioning Elasticsearch
========================

Elasticsearch is a complicated product, and is run in many different scenarios.
A single version number is not sufficient to cover the whole of the product,
instead we need different concepts to provide versioning capabilities
for different aspects of Elasticsearch, depending on their scope, updatability,
responsiveness, and maintenance.

## Release version

This is the version number used for published releases of Elasticsearch,
and the Elastic stack. This takes the form _major.minor.patch_,
with a corresponding version id.

Uses of this version number should be avoided, as it does not apply to
some scenarios, and use of release version will break Elasticsearch nodes.

The release version is accessible in code through `Build.current().version()`,
but it **should not** be assumed that this is a semantic version number,
it could be any arbitrary string.

## Transport protocol

The transport protocol is used to send binary data between Elasticsearch nodes;
`TransportVersion` is the version number used for this protocol.
This version number is negotiated between each pair of nodes in the cluster
on first connection, and is set as the lower of the highest transport version
understood by each node.
This version is then accessible through the `getTransportVersion` method
on `StreamInput` and `StreamOutput`, so serialization code can read/write
objects in a form that will be understood by the other node.

Every change to the transport protocol is represented by a new transport version,
higher than all previous transport versions, which then becomes the highest version
recognized by that build of Elasticsearch. The version ids are stored
as constants in the `TransportVersions` class.
Each id has a standard pattern `M_NNN_SS_P`, where:
* `M` is the major version
* `NNN` is an incrementing id
* `SS` is used in subsidiary repos amending the default transport protocol
* `P` is used for patches and backports

When you make a change to the serialization form of any object,
you need to create a new sequential constant in `TransportVersions`,
introduced in the same PR that adds the change, that increments
the `NNN` component from the previous highest version,
with other components set to zero.
For example, if the previous version number is `8_413_00_1`,
the next version number should be `8_414_00_0`.

Once you have defined your constant, you then need to use it
in serialization code. If the transport version is at or above the new id,
the modified protocol should be used:

str = in.readString();
bool = in.readBoolean();
if (in.getTransportVersion().onOrAfter(TransportVersions.NEW_CONSTANT)) {
num = in.readVInt();
}

If a transport version change needs to be reverted, a **new** version constant
should be added representing the revert, and the version id checks
adjusted appropriately to only use the modified protocol between the version id
the change was added, and the new version id used for the revert (exclusive).
The `between` method can be used for this.

Once a transport change with a new version has been merged into main or a release branch,
it **must not** be modified - this is so the meaning of that specific
transport version does not change.

_Elastic developers_ - please see corresponding documentation for Serverless
on creating transport versions for Serverless changes.

### Collapsing transport versions

As each change adds a new constant, the list of constants in `TransportVersions`
will keep growing. However, once there has been an official release of Elasticsearch,
that includes that change, that specific transport version is no longer needed,
apart from constants that happen to be used for release builds.
As part of managing transport versions, consecutive transport versions can be
periodically collapsed together into those that are only used for release builds.
This task is normally performed by Core/Infra on a semi-regular basis,
usually after each new minor release, to collapse the transport versions
for the previous minor release. An example of such an operation can be found
[here](https://github.com/elastic/elasticsearch/pull/104937).

### Minimum compatibility versions

The transport version used between two nodes is determined by the initial handshake
(see `TransportHandshaker`, where the two nodes swap their highest known transport version).
The lowest transport version that is compatible with the current node
is determined by `TransportVersions.MINIMUM_COMPATIBLE`,
and the node is prevented from joining the cluster if it is below that version.
This constant should be updated manually on a major release.

The minimum version that can be used for CCS is determined by
`TransportVersions.MINIMUM_CCS_VERSION`, but this is not actively checked
before queries are performed. Only if a query cannot be serialized at that
version is an action rejected. This constant is updated automatically
as part of performing a release.

### Mapping to release versions

For releases that do use a version number, it can be confusing to encounter
a log or exception message that references an arbitrary transport version,
where you don't know which release version that corresponds to. This is where
the `.toReleaseVersion()` method comes in. It uses metadata stored in a csv file
(`TransportVersions.csv`) to map from the transport version id to the corresponding
release version. For any transport versions it encounters without a direct map,
it performs a best guess based on the information it has. The csv file
is updated automatically as part of performing a release.

In releases that do not have a release version number, that method becomes
a no-op.

### Managing patches and backports

Backporting transport version changes to previous releases
should only be done if absolutely necessary, as it is very easy to get wrong
and break the release in a way that is very hard to recover from.

If we consider the version number as an incrementing line, what we are doing is
grafting a change that takes effect at a certain point in the line,
to additionally take effect in a fixed window earlier in the line.

To take an example, using indicative version numbers, when the latest
transport version is 52, we decide we need to backport a change done in
transport version 50 to transport version 45. We use the `P` version id component
to create version 45.1 with the backported change.
This change will apply for version ids 45.1 to 45.9 (should they exist in the future).

The serialization code in the backport needs to use the backported protocol
for all version numbers 45.1 to 45.9. The `TransportVersion.isPatchFrom` method
can be used to easily determine if this is the case: `streamVersion.isPatchFrom(45.1)`.
However, the `onOrAfter` also does what is needed on patch branches.

The serialization code in version 53 then needs to additionally check
version numbers 45.1-45.9 to use the backported protocol, also using the `isPatchFrom` method.

As an example, [this transport change](https://github.com/elastic/elasticsearch/pull/107862)
was backported from 8.15 to [8.14.0](https://github.com/elastic/elasticsearch/pull/108251)
and [8.13.4](https://github.com/elastic/elasticsearch/pull/108250) at the same time
(8.14 was a build candidate at the time).

The 8.13 PR has:

if (transportVersion.onOrAfter(8.13_backport_id))

The 8.14 PR has:

if (transportVersion.isPatchFrom(8.13_backport_id)
|| transportVersion.onOrAfter(8.14_backport_id))

The 8.15 PR has:

if (transportVersion.isPatchFrom(8.13_backport_id)
|| transportVersion.isPatchFrom(8.14_backport_id)
|| transportVersion.onOrAfter(8.15_transport_id))

In particular, if you are backporting a change to a patch release,
you also need to make sure that any subsequent released version on any branch
also has that change, and knows about the patch backport ids and what they mean.

## Index version

Index version is a single incrementing version number for the index data format,
metadata, and associated mappings. It is declared the same way as the
transport version - with the pattern `M_NNN_SS_P`, for the major version, version id,
subsidiary version id, and patch number respectively.

Index version is stored in index metadata when an index is created,
and it is used to determine the storage format and what functionality that index supports.
The index version does not change once an index is created.

In the same way as transport versions, when a change is needed to the index
data format or metadata, or new mapping types are added, create a new version constant
below the last one, incrementing the `NNN` version component.

Unlike transport version, version constants cannot be collapsed together,
as an index keeps its creation version id once it is created.
Fortunately, new index versions are only created once a month or so,
so we don’t have a large list of index versions that need managing.

Similar to transport version, index version has a `toReleaseVersion` to map
onto release versions, in appropriate situations.

## Cluster Features

Cluster features are identifiers, published by a node in cluster state,
indicating they support a particular top-level operation or set of functionality.
They are used for internal checks within Elasticsearch, and for gating tests
on certain functionality. For example, to check all nodes have upgraded
to a certain point before running a large migration operation to a new data format.
Cluster features should not be referenced by anything outside the Elasticsearch codebase.

Cluster features are indicative of top-level functionality introduced to
Elasticsearch - e.g. a new transport endpoint, or new operations.

It is also used to check nodes can join a cluster - once all nodes in a cluster
support a particular feature, no nodes can then join the cluster that do not
support that feature. This is to ensure that once a feature is supported
by a cluster, it will then always be supported in the future.

To declare a new cluster feature, add an implementation of the `FeatureSpecification` SPI,
suitably registered (or use an existing one for your code area), and add the feature
as a constant to be returned by getFeatures. To then check whether all nodes
in the cluster support that feature, use the method `clusterHasFeature` on `FeatureService`.
It is only possible to check whether all nodes in the cluster have a feature;
individual node checks should not be done.

Once a cluster feature is declared and deployed, it cannot be modified or removed,
else new nodes will not be able to join existing clusters.
If functionality represented by a cluster feature needs to be removed,
a new cluster feature should be added indicating that functionality is no longer
supported, and the code modified accordingly (bearing in mind additional BwC constraints).

The cluster features infrastructure is only designed to support a few hundred features
per major release, and once features are added to a cluster they can not be removed.
Cluster features should therefore be used sparingly.
Adding too many cluster features risks increasing cluster instability.

When we release a new major version N, we limit our backwards compatibility
to the highest minor of the previous major N-1. Therefore, any cluster formed
with the new major version is guaranteed to have all features introduced during
releases of major N-1. All such features can be deemed to be met by the cluster,
and the features themselves can be removed from cluster state over time,
and the feature checks removed from the code of major version N.

### Testing

Tests often want to check if a certain feature is implemented / available on all nodes,
particularly BwC or mixed cluster test.

Rather than introducing a production feature just for a test condition,
this can be done by adding a _test feature_ in an implementation of
`FeatureSpecification.getTestFeatures`. These features will only be set
on clusters running as part of an integration test. Even so, cluster features
should be used sparingly if possible; Capabilities is generally a better
option for test conditions.

In Java Rest tests, checking cluster features can be done using
`ESRestTestCase.clusterHasFeature(feature)`

In YAML Rest tests, conditions can be defined in the `requires` or `skip` sections
that use cluster features; see [here](https://github.com/elastic/elasticsearch/blob/main/rest-api-spec/src/yamlRestTest/resources/rest-api-spec/test/README.asciidoc#skipping-tests) for more information.

To aid with backwards compatibility tests, the test framework adds synthetic features
for each previously released Elasticsearch version, of the form `gte_v{VERSION}`
(for example `gte_v8.14.2`).
This can be used to add conditions based on previous releases. It _cannot_ be used
to check the current snapshot version; real features or capabilities should be
used instead.

## Capabilities

The Capabilities API is a REST API for external clients to check the capabilities
of an Elasticsearch cluster. As it is dynamically calculated for every query,
it is not limited in size or usage.

A capabilities query can be used to query for 3 things:
* Is this endpoint supported for this HTTP method?
* Are these parameters of this endpoint supported?
* Are these capabilities (arbitrary string ids) of this endpoint supported?

The API will return with a simple true/false, indicating if all specified aspects
of the endpoint are supported by all nodes in the cluster.
If any aspect is not supported by any one node, the API returns `false`.

The API can also return `supported: null` (indicating unknown)
if there was a problem communicating with one or more nodes in the cluster.

All registered endpoints automatically work with the endpoint existence check.
To add support for parameter and feature capability queries to your REST endpoint,
implement the `supportedQueryParameters` and `supportedCapabilities` methods in your rest handler.

To perform a capability query, perform a REST call to the `_capabilities` API,
with parameters `method`, `path`, `parameters`, `capabilities`.
The call will query every node in the cluster, and return `{supported: true}`
if all nodes support that specific combination of method, path, query parameters,
and endpoint capabilities. If any single aspect is not supported,
the query will return `{supported: false}`. If there are any problems
communicating with nodes in the cluster, the response will be `{supported: null}`
indicating support or lack thereof cannot currently be determined.
Capabilities can be checked using the clusterHasCapability method in ESRestTestCase.

Similar to cluster features, YAML tests can have skip and requires conditions
specified with capabilities like the following:

- requires:
capabilities:
- method: GET
path: /_endpoint
parameters: [param1, param2]
capabilities: [cap1, cap2]

method: GET is the default, and does not need to be explicitly specified.

0 comments on commit acd4f07

Please sign in to comment.