Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Standard] KaaS default StorageClass v2 (scs-0211-v2) #658

Merged
merged 16 commits into from
Aug 29, 2024

Conversation

martinmo
Copy link
Member

Resolves: #652

@martinmo martinmo self-assigned this Jul 16, 2024
Signed-off-by: Martin Morgenstern <[email protected]>
@martinmo martinmo marked this pull request as ready for review July 17, 2024 13:30
@martinmo
Copy link
Member Author

@joshmue can you please have a look at this. @mbuechse kindly agreed to have a look at the formal aspects.

My plan is to proceed with the conformance tests once we have merged this v2 as draft. And then, when conformance tests are ready & merged, we can proceed to stabilize it.

Copy link
Contributor

@joshmue joshmue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, except...

Standards/scs-0211-v2-kaas-default-storage-class.md Outdated Show resolved Hide resolved
Co-authored-by: Joshua Mühlfort <[email protected]>
Signed-off-by: Martin Morgenstern <[email protected]>
Copy link
Contributor

@mbuechse mbuechse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add something about previous versions, as in https://docs.scs.community/standards/scs-0100-v3-flavor-naming
-- first, one sentence in the intro about this version, and then one section further below about previous versions and what has changed compared to those.

@martinmo martinmo requested a review from mbuechse July 19, 2024 13:42
Comment on lines 47 to 51
Previously, the backing storage of the default storage class was required to be protected
against data loss caused by a physical disk or host failure.
It also contained recommendations (MAY) with regard to redundant storage across hosts
or availability zones.
In this revision, these requirements and recommendations have been dropped.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This clarification actually confused me a bit, as I was under the impression that protection "against data loss caused by a physical disk or host failure" was intended to be covered by "MUST NOT be backed by local or ephemeral storage".

Regarding edge setups, you concluded in #652:

  • the requirement of "not being backed by local storage" should remain
    ...
  • there was a quick discussion whether this is too strict from a standards point of view
    • example: edge clouds probably cannot fulfill this -> but this a very special case and might be solved by another hypothetical certificate scope for edge environments

There are not that many options (or use cases) to implement storage that is NOT being backed by local storage, yet NOT being protected against data loss due to disk/host failure.
So this concession to CSP's will probably create little value for CSP's, yet would reduce the value for users who may want to rely on SCS standards to be "sensible/common defaults". I do not know of any major block storage cloud offering that does not provide basic redundancy or at least snapshotting.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This clarification actually confused me a bit, as I was under the impression that protection "against data loss caused by a physical disk or host failure" was intended to be covered by "MUST NOT be backed by local or ephemeral storage".

I removed the "physical disk or host failure" sentence because we will not be able to test this as part of the conformance checks anyway.

My reasoning behind this is: even if you use the Cinder CSI, or let's say, the Rook CSI provisioner, the backing storage could in theory be misconfigured wrt to single host or disk failure tolerance, but there is, AFAIK, no reliable way to check this from inside the K8s cluster (and with minimal privileges).

Of course, our expectation would be that this failure tolerance is configured correctly by the CSP. As a compromise I can add a sentence that documents this expectation.

In my understanding, the important bit of this standard is that a provisioned volume isn't bound to the node's lifecycle, i.e., it doesn't just use something like the rancher.io/local-path or the lvm.csi.metal-stack.io provisioner.

For the test, I currently plan to check against an allowlist of provisioners (e.g., cinder.csi.openstack.org, *.rbd.csi.ceph.com, driver.longhorn.io, may need extension in the future), because I think it is the most pragmatic approach. (Going one step further, the test could even try to mount the same PVC on two different nodes.)

So this concession to CSP's will probably create little value for CSP's, yet would reduce the value for users who may want to rely on SCS standards to be "sensible/common defaults". I do not know of any major block storage cloud offering that does not provide basic redundancy or at least snapshotting.

I do not agree completely: there is a value for the user if we prevent that a "simple" provisioner is used (like local-path), e.g., volume is independent of the node, can be used on another node, etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding testability: Alright, did not know that this that much of an important factor. 👍


I do not agree completely: there is a value for the user if we prevent that a "simple" provisioner is used (like local-path), e.g., volume is independent of the node, can be used on another node, etc.

Definitely! Yet, it's kind of a niche use case.

In my experience, one would want to have either...

  • some failure-safe volume for any kind of application. Maybe a trivial non-HA database installation for non-HA use cases. While availability may be of little concern, data durability often will be, making a storage backend with some hands-off redundancy very much favorable. This is the default on most clouds, which will also work with most use cases involving applications implementing data high availability/durability themselves.
    OR
  • some not necessarily failure-safe volume (may be local/node-bound/non-redundant, possibly fast) to run applications that take care of data durability and availability on application level.

While offering a default storage class which is not redundant/failure-safe/replicated in any way, only non-local, would make e. g. "A lot of third-party software, such as Helm charts, [which] assume that a default storage class is configured" technically initially installable, it would probably also break assumptions of these charts regarding data durability, undermining this standard's motivation for an "out-of-the-box working experience" (at least some time later if/when a disk eventually fails).

That may be the outcome (yet I do not know of CSP's who are really restrained here), but no matter what, IMHO it should be spelled out more explicitly also in the "Decision" section.

@martinmo martinmo added Container Issues or pull requests relevant for Team 2: Container Infra and Tooling SCS-VP10 Related to tender lot SCS-VP10 labels Aug 2, 2024
@mbuechse mbuechse added the good first issue Good for newcomers label Aug 26, 2024
Signed-off-by: Katharina Trentau <[email protected]>
@fraugabel
Copy link
Contributor

added the omitted requirements as recommendations under Decisions

Katharina Trentau and others added 5 commits August 28, 2024 08:40
Signed-off-by: Katharina Trentau <[email protected]>
Signed-off-by: Katharina Trentau <[email protected]>
Signed-off-by: Katharina Trentau <[email protected]>
…gnCloudStack/standards into 652-kaas-default-storageclass-v2
Signed-off-by: Katharina Trentau <[email protected]>
Copy link
Contributor

@fraugabel fraugabel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

necessary requirements for fail-safe data carriers have been added as recommendations, as there may be use cases that allow these to be ignored. In general, these recommendations can be adhered to and checked by selecting the right provisioner.

@fraugabel fraugabel merged commit f2dd384 into main Aug 29, 2024
6 checks passed
@fraugabel fraugabel deleted the 652-kaas-default-storageclass-v2 branch August 29, 2024 09:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Container Issues or pull requests relevant for Team 2: Container Infra and Tooling good first issue Good for newcomers SCS-VP10 Related to tender lot SCS-VP10
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Standard] A simple, usable default StorageClass should be available in Kubernetes clusters
4 participants