Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Brupop is stuck at RebootedIntoUpdate state #650

Open
Gaurav2586 opened this issue Jul 19, 2024 · 6 comments
Open

Brupop is stuck at RebootedIntoUpdate state #650

Gaurav2586 opened this issue Jul 19, 2024 · 6 comments

Comments

@Gaurav2586
Copy link

Gaurav2586 commented Jul 19, 2024

NAME                                         STATE                      VERSION   TARGET STATE         TARGET VERSION   CRASH COUNT
brs-ip-10-21-x-x.ec2.internal   Idle                       1.20.2    Idle                 <no value>       0
brs-ip-10-21-x-x.ec2.internal   StagedAndPerformedUpdate   1.20.2    RebootedIntoUpdate   1.20.3           0

Brupop has been stuck at RebootedIntoUpdate for a long time, and nothing can be seen in the logs related to this status and error.

Logs -

spec: BottlerocketShadowSpec { state: Idle, state_transition_timestamp: None, version: None }, status: Some(BottlerocketShadowStatus { current_version: "1.20.2", target_version: "1.20.3", current_state: Idle, crash_count: 0, state_transition_failure_timestamp: None }) }, state: Idle, shadow_error_info: ShadowErrorInfo { crash_count: 0, state_transition_failure_timestamp: None }

Note:- Out of 3 nodes, 1 node was updated successfully, this was the second node.
and we have PDBs. It will not work with PDBs ??

Is this because of this configuration

Namespace:        dev
Max unavailable:  1
Selector:         app.kubernetes.io/instance=xyz-dev,app.kubernetes.io/name=xyz
Status:
    Allowed disruptions:  0        <<<<<<
    Current:              0
    Desired:              2
    Total:                3
Events:                   <none>
@cbgbt
Copy link
Contributor

cbgbt commented Jul 19, 2024

I'll take a look. Do you mind sharing which version of Brupop this is using?

@cbgbt
Copy link
Contributor

cbgbt commented Jul 19, 2024

This configuration:

NAME        STATE                      VERSION   TARGET STATE         TARGET VERSION   CRASH COUNT
$HOSTNAME   StagedAndPerformedUpdate   1.20.2    RebootedIntoUpdate   1.20.3           0

Means that your host "staged" the update. It's installed to a the alternate disk partition and Bottlerocket is ready to flip to it upon reboot. The host is attemping to move into the RebootedIntoUpdate state.

In order to enter the rebooted state, the host:

  • cordons the node, disallowing new tasks from being scheduled to it
  • drains workloads running on the node, while respecting pod disruption budgets
  • reboots to switch to the new update
  • uncordons the node, allowing it to accept pod deployments

Is this because of this configuration?

...
    Allowed disruptions:  0        <<<<<<
...

Yes, I think so. Brupop respects your PDB, so at the moment it's probably attempting to evict a protected pod, but Kubernetes is not allowing any disruptions. The reason why would become more clear if you shared your PDB's spec, and more information about what pods are running and where (as well as their current status).

If you want more logs from Brupop's side, the drain would be completed by one of Brupop's apiserver pods, so it should have any relevant logs for that operation.

@Gaurav2586
Copy link
Author

Gaurav2586 commented Jul 19, 2024

PDB's Spec for Two of my services - these services are running on the same node which is stuck at RebootedIntoUpdate state.

Name:             xyz
Namespace:        dev
Max unavailable:  1
Selector:         app.kubernetes.io/instance=xyz-dev,app.kubernetes.io/name=xyz
Status:
    Allowed disruptions:  0
    Current:              0       <<< this service is crashing at this time
    Desired:              4
    Total:                5
Events:                   <none>


Name:             abc
Namespace:        dev
Max unavailable:  1
Selector:         app.kubernetes.io/instance=abc-dev,app.kubernetes.io/name=abc
Status:
    Allowed disruptions:  0
    Current:              1
    Desired:              1
    Total:                2
Events:                   <none>

Any solution for this kind of situation? It’s very unlikely that all services will always run in the desired state, especially in lower environments where people frequently conduct experiments and testing.

Looks like if the service's current running pod count is 0, then Allowed disruptions will also be 0, and brupop never completes its upgrade task and always will be in a stuck state

@cbgbt
Copy link
Contributor

cbgbt commented Jul 19, 2024

The interface that Brupop uses to interact with PDBs is that it makes an eviction request to the Kubernetes API, then that API responds specially depending on the state of the target pod, PDBs, etc.

Here's the code that handles draining and PDBs..

So basically:

  • Brupop says: "please evict service pod abcdefg"
  • Kubernetes responds with code 429: "This request is not allowed due to a PDB"
  • Brupop, not wanting to clobber a running service, waits to try again later

There's not really additional information provided during this interaction, Brupop assumes that the PDBs configuration must be satisfied and therefor waits to attempt the eviction again when it's possible that the cluster state has changed such that the PDB will no longer be dissatisfied.

I suppose my advice here would be that the cluster needs to return to a state in which Brupop's drain would not appear to Kubernetes as though it were disrupting a PDB. Perhaps the unhealthy service should trigger a rollback to a healthy state?

@cbgbt
Copy link
Contributor

cbgbt commented Jul 19, 2024

Another alternative for lower-stakes dev environments could be to specifically remove the PDBs in those environments.

@Gaurav2586
Copy link
Author

Gaurav2586 commented Jul 23, 2024

I understand that Brupop respects the PDB, but what's the point of respecting the PDB of services that are running 0 pods because of CrashLoopBackOff and have restarted 100+ times? It's very unlikely that all the pods should be in the running state all the time, and if pods are not in the running state, the allowed disruptions will be 0 always. hence Brupop will be stuck in the upgrade process. In my case, Brupop is stuck because of the pods showing CrashLoopBackOff status.

Note: - Below example is different than CrashLoopBackOff

Where is my Brupop stuck in -
abc-wtjj2                                               1/1     Running                  
abc-v86ws                                            0/1     Running
PDB describe -
Name:            abc
Namespace:        dev
Max unavailable:  1
Selector:         app.kubernetes.io/instance=abc-dev,app.kubernetes.io/name=abc
Status:
    Allowed disruptions:  0
    Current:              1
    Desired:              1
    Total:                2
Events:                   <none>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants