Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Throttle handling of shoot events in the frontend #1637

Merged
merged 82 commits into from
Dec 7, 2023
Merged

Conversation

holgerkoser
Copy link
Member

@holgerkoser holgerkoser commented Nov 8, 2023

What this PR does / why we need it:

Throttling Shoot Events in the Frontend

In scenarios where we manage extensive lists of shoot clusters, each modification to any shoot cluster was previously dispatched individually to the frontend via a socket.io connection. This process led to the constant re-sorting and complete replacement of the list in the store. Each of these socket.io events encompassed the full shoot manifest. With the introduction of this PR, we have streamlined the process: now, only the UID of the altered shoot cluster is transmitted. These modified UIDs are accumulated in the store and are synchronized at maximum intervals of 0.2 to 30 seconds. This synchronization is also executed through a socket.io message. On the backend side, the current UIDs are fetched from the watch cache. A check is performed to ascertain whether the socket has joined a room associated with the relevant namespace. The cluster manifest is then relayed to the frontend only if this condition is met. Subsequently, in the frontend, all modifications are processed within a shootStore.$patch function. This ensures that the list is rendered efficiently, avoiding multiple re-renderings.

An experimental configuration option has been introduced to fine-tune the throttle delay per cluster. This option, found under Values.global.dashboard.frontendConfig.experimental.throttleDelayPerCluster, allows administrators to set the base number of milliseconds delay per cluster. This delay dynamically adjusts the synchronization throttle based on the number of active clusters, optimizing performance and resource utilization in environments with a varying number of clusters.
Note: As this is an experimental feature, its behavior and effectiveness are subject to evaluation and potential future adjustments.

Removal of the key property in the g-shoot-list-row component

By removing the key property, instances of the g-shoot-list-row component, along with all its children, were no longer destroyed and recreated each time the shoot list was paged or re-sorted. This change has significantly reduced excessive memory turnover, which previously overwhelmed the garbage collector."

Experimental Feature: Control of watch cache for list shoots requests

This feature allows fine-tuning the behavior of how list shoots requests are handled in terms of caching.
Note: As this is an experimental feature, it is subject to change and may not be stable.
Possible values for Values.global.dashboard.experimentalUseWatchCacheForListShoots are:

  • never: All list shoots requests are directly forwarded to the kube-apiserver without using the watch cache.
  • no: The watch cache is not utilized by default. However, clients can opt-in to use the watch cache on a per-request basis.
  • yes: The watch cache is used by default for serving list shoots requests. Clients can opt-out if they require data directly from the kube-apiserver.
  • always: All list shoots requests are served from the watch cache, reducing the load on the kube-apiserver.
    The default value is currently never.
Reduce network traffic

The synchronization of updated shoots is no longer performed when the browser tab is in the background. This significantly reduces network traffic when users have multiple tabs open or leave them open overnight.

CPU and Memory comparison

When all projects were displayed in the cluster list, the CPU usage consistently exceeded 98% (Process ID 89141). Following the modification, the CPU usage dropped to less than 0.5% (Process ID 88943). Memory usage has also decreased, as previously, memory allocation occurred more rapidly than garbage collection could free it.

Screenshot 2023-11-28 at 16 48 36

Performance comparison

Before the Change

Update component instance was effectively being called constantly, as events that altered the shoot list were being processed continuously.
Screenshot 2023-11-21 at 13 22 44

An instance update took 2.41 seconds. The vast majority of this time was spent sorting and filtering the shoot list.
Screenshot 2023-11-21 at 13 23 23

Filtering was being performed even though no filter was specified. This took about 2.06 seconds.
Screenshot 2023-11-21 at 13 46 36

After the Change

The update of the shoot list is now dynamically throttled and occurs much less frequently.
Screenshot 2023-11-21 at 13 14 52

The instance update now takes 150 ms.
Screenshot 2023-11-21 at 13 15 38

The order of sorting and filtering has been reversed, and the filtering has been moved to its own getter to automatically cache the filtered items. Filtering is only performed when a filter is specified. Otherwise, the items are returned directly. The sorting, especially the retrieval of sorting values, has been optimized, making it significantly faster.
Sorting by e.g. ISSUE SINCE now takes approximately 10 ms and can be further optimized if necessary.
Screenshot 2023-11-21 at 13 17 27

Network activity comparison

The initial loading of all cluster has also been improved by serving the data from the gardener dashboard watch cache.

Before the Change

Previously, loading approximately 3200 clusters from the kube-apiserver through the dashboard backend into the browser took over 14 seconds. The amount of data transferred was 7.8MB.
Screenshot 2023-11-28 at 16 41 46

After the Change

Now, loading approximately 3200 clusters from the watch cache in the dashboard backend into the browser takes about 700 ms. The amount of data transferred has been reduced to 3.0MB.
Screenshot 2023-11-28 at 16 39 54

Which issue(s) this PR fixes:
Fixes #1636 Fixes #1644 Fixes #1653 Fixes #1654

Special notes for your reviewer:

Release note:

An improvement in performance and memory usage on the shoot list has been achieved when a large number of clusters are present. In the past, under heavy load, there were repeated instances where the dashboard became unresponsive due to very high memory consumption. This has been achieved by implementing the following two changes:
* Throttling of shoot events in the frontend. 
  Now, only the `uid` of the modified object is sent to the client, coupled with periodic synchronization of associated shoots.
* Removal of the key property in the `g-shoot-list-row` component 
* Improved performance of sorting and filtering implementation
* Faster response times for list shoot request (experimental: must be enabled by an operator)
* Reduced network traffic for invisible browser tabs 
Experimental Features:
* Enhanced Watch Cache Control for List Shoots Requests. 
  We've introduced a new feature to fine-tune caching behavior for list shoots requests. A new configuration option, `Values.global.dashboard.experimentalUseWatchCacheForListShoots`, has been added to the `gardener-dashboard` Helm chart. This allows for more precise control over caching with four settings: `never`, `no`, `yes`, and `always`. By default, this is set to `never`. As an experimental feature, we welcome feedback and suggest caution in production environments.
* Fine-tune the throttle delay per cluster.
  This option, found under `Values.global.dashboard.frontendConfig.experimental.throttleDelayPerCluster`, allows administrators to set the base number of milliseconds delay per cluster. This delay dynamically adjusts the synchronization throttle based on the number of active clusters, optimizing performance and resource utilization in environments with a varying number of clusters.

@gardener-robot gardener-robot added the needs/review Needs review label Nov 8, 2023
@holgerkoser holgerkoser marked this pull request as draft November 8, 2023 16:29
@gardener-robot gardener-robot added size/xl Size of pull request is huge (see gardener-robot robot/bots/size.py) needs/second-opinion Needs second review by someone else labels Nov 8, 2023
@gardener-robot-ci-3 gardener-robot-ci-3 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Nov 8, 2023
@holgerkoser holgerkoser changed the title Throttle handling of shoot events in the frontend [DRAFT] Throttle handling of shoot events in the frontend Nov 8, 2023
@gardener-robot-ci-3 gardener-robot-ci-3 added needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Nov 8, 2023
@gardener-robot-ci-2 gardener-robot-ci-2 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Nov 9, 2023
@gardener-robot-ci-1 gardener-robot-ci-1 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Nov 9, 2023
@gardener-robot-ci-1 gardener-robot-ci-1 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Dec 7, 2023
@gardener-robot-ci-2 gardener-robot-ci-2 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Dec 7, 2023
…/fix-1636

* 'bug/fix-1636' of github.com:gardener/dashboard:
  switch GCR -> Artifact-Registry (#1645)
  Replaced absolute links with relative (#1643)
  Fix: Show machine image description and hints (#1635)
  Improve table column selection menu (#1642)
@gardener-robot-ci-1 gardener-robot-ci-1 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Dec 7, 2023
Copy link
Member

@petersutter petersutter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

very nice improvement!! 🚀

@gardener-robot gardener-robot added reviewed/lgtm Has approval for merging and removed needs/review Needs review needs/second-opinion Needs second review by someone else labels Dec 7, 2023
@gardener-robot-ci-1 gardener-robot-ci-1 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Dec 7, 2023
@gardener-robot gardener-robot added needs/second-opinion Needs second review by someone else and removed reviewed/lgtm Has approval for merging labels Dec 7, 2023
@gardener-robot-ci-3 gardener-robot-ci-3 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Dec 7, 2023
@holgerkoser holgerkoser merged commit d51b4b0 into master Dec 7, 2023
9 checks passed
@gardener-robot gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Dec 7, 2023
@holgerkoser holgerkoser deleted the bug/fix-1636 branch December 7, 2023 16:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ipcei IPCEI (Important Project of Common European Interest) needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) needs/second-opinion Needs second review by someone else size/xl Size of pull request is huge (see gardener-robot robot/bots/size.py) status/closed Issue is closed (either delivered or triaged)
Projects
None yet
7 participants