-
Notifications
You must be signed in to change notification settings - Fork 103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failed to recover after node took the drive (volume) offline #615
Comments
Hey @SidAtQuanos, sorry for the very late response.
|
Hi @apricote , thanks for looking into this. Honestly, when this error happend having "VolumeAutoScaler" active one of the first things we did was disabling the volume auto scaling. And we try to avoid having frequent auto scales like we had that time when those two errors came up. So ...
Unfortunately I do not have live logs available. Would these infos be sufficient? |
Is this the VolumeAutoScaler you were using? I can try to reproduce the issue locally with it, but it sounds like this is very flaky and might not happen to me :( https://github.com/DevOps-Nirvana/Kubernetes-Volume-Autoscaler |
Yes, I can confirm this is the volume autoscaler. This is the configuration verbose: "false" The minio deployment had 6 pods/pvc, start size of the volume was 10 Gi. The target size of when the autscale stopped was around '241484536217'. One other was at '314069483520'. I copied over 450 Gi of data from another cluster using mc. I tried to stop and restart the script in time to give the autoscaler the time to react (whenever all pvc according to grafana reached more than 80% of usage, and not always was i successful. First attempt was a full disk before I noticed). After all files have been copied to minio, all but one pvc had their final size. One pvc was at around 81% or 83% of usage and was in the loop of eternal resizes due to the disk being offline. It was quite a few iterations. And one 1 out of 6 pvc was affected. In its last iteration. Maybe this helps to understand the setup/pressure/... |
Maybe this approach may work So possibly not try-hard on reproducing the exact same scenario (or pretty much similiar) but rather take the disk offline manually (when its more than 80% of usage). Possibly installing the auto scaler only after taking the disk offline may be sufficient to still have it trigger the resize and therefore the csi-driver. |
TL;DR
Hi there,
we had a kubernetes volume autoscaling mechanism active in our cluster (devops-nirvana). For a while the autoscaling worked just fine. At some point the operating system of the kubernetes node took the drive offline. The csi driver failed to recognize the offline state and continued with increasing the PVC
Expected behavior
The csi driver is capable to react on offline drives in means of taking the drive online again, so that an actual resize2fs works just fine. Or alternatively the driver should not allow the increase of PVC if the corresponding drive is set offline
Observed behavior
PVC got increased.
PV got increased.
resize2fs was triggered but failed. Next iteration. (as autoscaler checks the existing quota which did not change, so the PVC got increased again and again)
CSI Driver seems to be totally unaware of these problems (see second log snippet).
In the minio pod log we faced these messages:
Error: node(: taking drive /export offline: unable to write+read for 2m0.001s (*errors.errorString)
3: internal/logger/logger.go:248:logger.LogAlwaysIf()
2: cmd/xl-storage-disk-id-check.go:1066:cmd.(*xlStorageDiskIDCheck).monitorDiskWritable.func1.1()
1: cmd/xl-storage-disk-id-check.go:1085:cmd.(*xlStorageDiskIDCheck).monitorDiskWritable.func1.2()
Minimal working example
No response
Log output
Looked up in journalctl
Additional information
Kubernetes: 1.29.1
csi driver: 2.6.0
The text was updated successfully, but these errors were encountered: