Brupop is stuck at RebootedIntoUpdate state #650

Gaurav2586 · 2024-07-19T09:55:09Z

NAME                                         STATE                      VERSION   TARGET STATE         TARGET VERSION   CRASH COUNT
brs-ip-10-21-x-x.ec2.internal   Idle                       1.20.2    Idle                 <no value>       0
brs-ip-10-21-x-x.ec2.internal   StagedAndPerformedUpdate   1.20.2    RebootedIntoUpdate   1.20.3           0

Brupop has been stuck at RebootedIntoUpdate for a long time, and nothing can be seen in the logs related to this status and error.

Logs -

spec: BottlerocketShadowSpec { state: Idle, state_transition_timestamp: None, version: None }, status: Some(BottlerocketShadowStatus { current_version: "1.20.2", target_version: "1.20.3", current_state: Idle, crash_count: 0, state_transition_failure_timestamp: None }) }, state: Idle, shadow_error_info: ShadowErrorInfo { crash_count: 0, state_transition_failure_timestamp: None }

Note:- Out of 3 nodes, 1 node was updated successfully, this was the second node.
and we have PDBs. It will not work with PDBs ??

Is this because of this configuration

Namespace:        dev
Max unavailable:  1
Selector:         app.kubernetes.io/instance=xyz-dev,app.kubernetes.io/name=xyz
Status:
    Allowed disruptions:  0        <<<<<<
    Current:              0
    Desired:              2
    Total:                3
Events:                   <none>

The text was updated successfully, but these errors were encountered:

cbgbt · 2024-07-19T15:31:33Z

I'll take a look. Do you mind sharing which version of Brupop this is using?

cbgbt · 2024-07-19T17:54:47Z

This configuration:

NAME        STATE                      VERSION   TARGET STATE         TARGET VERSION   CRASH COUNT
$HOSTNAME   StagedAndPerformedUpdate   1.20.2    RebootedIntoUpdate   1.20.3           0

Means that your host "staged" the update. It's installed to a the alternate disk partition and Bottlerocket is ready to flip to it upon reboot. The host is attemping to move into the RebootedIntoUpdate state.

In order to enter the rebooted state, the host:

cordons the node, disallowing new tasks from being scheduled to it
drains workloads running on the node, while respecting pod disruption budgets
reboots to switch to the new update
uncordons the node, allowing it to accept pod deployments

Is this because of this configuration?
...
    Allowed disruptions:  0        <<<<<<
...

Yes, I think so. Brupop respects your PDB, so at the moment it's probably attempting to evict a protected pod, but Kubernetes is not allowing any disruptions. The reason why would become more clear if you shared your PDB's spec, and more information about what pods are running and where (as well as their current status).

If you want more logs from Brupop's side, the drain would be completed by one of Brupop's apiserver pods, so it should have any relevant logs for that operation.

Gaurav2586 · 2024-07-19T18:02:51Z

PDB's Spec for Two of my services - these services are running on the same node which is stuck at RebootedIntoUpdate state.

Name:             xyz
Namespace:        dev
Max unavailable:  1
Selector:         app.kubernetes.io/instance=xyz-dev,app.kubernetes.io/name=xyz
Status:
    Allowed disruptions:  0
    Current:              0       <<< this service is crashing at this time
    Desired:              4
    Total:                5
Events:                   <none>

Name:             abc
Namespace:        dev
Max unavailable:  1
Selector:         app.kubernetes.io/instance=abc-dev,app.kubernetes.io/name=abc
Status:
    Allowed disruptions:  0
    Current:              1
    Desired:              1
    Total:                2
Events:                   <none>

Any solution for this kind of situation? It’s very unlikely that all services will always run in the desired state, especially in lower environments where people frequently conduct experiments and testing.

Looks like if the service's current running pod count is 0, then Allowed disruptions will also be 0, and brupop never completes its upgrade task and always will be in a stuck state

cbgbt · 2024-07-19T18:25:25Z

The interface that Brupop uses to interact with PDBs is that it makes an eviction request to the Kubernetes API, then that API responds specially depending on the state of the target pod, PDBs, etc.

Here's the code that handles draining and PDBs..

So basically:

Brupop says: "please evict service pod abcdefg"
Kubernetes responds with code 429: "This request is not allowed due to a PDB"
Brupop, not wanting to clobber a running service, waits to try again later

There's not really additional information provided during this interaction, Brupop assumes that the PDBs configuration must be satisfied and therefor waits to attempt the eviction again when it's possible that the cluster state has changed such that the PDB will no longer be dissatisfied.

I suppose my advice here would be that the cluster needs to return to a state in which Brupop's drain would not appear to Kubernetes as though it were disrupting a PDB. Perhaps the unhealthy service should trigger a rollback to a healthy state?

cbgbt · 2024-07-19T18:30:16Z

Another alternative for lower-stakes dev environments could be to specifically remove the PDBs in those environments.

Gaurav2586 · 2024-07-23T06:34:47Z

I understand that Brupop respects the PDB, but what's the point of respecting the PDB of services that are running 0 pods because of CrashLoopBackOff and have restarted 100+ times? It's very unlikely that all the pods should be in the running state all the time, and if pods are not in the running state, the allowed disruptions will be 0 always. hence Brupop will be stuck in the upgrade process. In my case, Brupop is stuck because of the pods showing CrashLoopBackOff status.

Note: - Below example is different than CrashLoopBackOff

Where is my Brupop stuck in -
abc-wtjj2                                               1/1     Running                  
abc-v86ws                                            0/1     Running

PDB describe -
Name:            abc
Namespace:        dev
Max unavailable:  1
Selector:         app.kubernetes.io/instance=abc-dev,app.kubernetes.io/name=abc
Status:
    Allowed disruptions:  0
    Current:              1
    Desired:              1
    Total:                2
Events:                   <none>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Brupop is stuck at RebootedIntoUpdate state #650

Brupop is stuck at RebootedIntoUpdate state #650

Gaurav2586 commented Jul 19, 2024 •

edited

Loading

cbgbt commented Jul 19, 2024

cbgbt commented Jul 19, 2024 •

edited

Loading

Gaurav2586 commented Jul 19, 2024 •

edited

Loading

cbgbt commented Jul 19, 2024 •

edited

Loading

cbgbt commented Jul 19, 2024

Gaurav2586 commented Jul 23, 2024 •

edited

Loading

Brupop is stuck at RebootedIntoUpdate state #650

Brupop is stuck at RebootedIntoUpdate state #650

Comments

Gaurav2586 commented Jul 19, 2024 • edited Loading

cbgbt commented Jul 19, 2024

cbgbt commented Jul 19, 2024 • edited Loading

Gaurav2586 commented Jul 19, 2024 • edited Loading

cbgbt commented Jul 19, 2024 • edited Loading

cbgbt commented Jul 19, 2024

Gaurav2586 commented Jul 23, 2024 • edited Loading

Gaurav2586 commented Jul 19, 2024 •

edited

Loading

cbgbt commented Jul 19, 2024 •

edited

Loading

Gaurav2586 commented Jul 19, 2024 •

edited

Loading

cbgbt commented Jul 19, 2024 •

edited

Loading

Gaurav2586 commented Jul 23, 2024 •

edited

Loading