Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add additional metrics for worker saturation analysis and scaling #11

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

jeisen
Copy link

@jeisen jeisen commented May 22, 2020

This adds three new metrics that can be used to implement scaling rules:

  • sidekiq_concurrency -- Total concurrency reported by sidekiq processes (i.e. max available)
  • sidekiq_available_workers -- The number of threads able to handle jobs, calculated from each sidekiq process by subtracting its busy workers from its concurrency. In the case of a "quiet" process, which is not accepting any jobs, this metric reports 0 available.
  • sidekiq_saturation -- The percentage of workers currently in use, calculated using the available_workers formula above. This is so that we count quiet processes as fully saturated; they cannot accept any more jobs, so from a scaling perspective they should not be counted as available.

We then scale in Kubernetes using the following metric:

((max(sidekiq_jobs_waiting_count) / max(sidekiq_concurrency) + max(sidekiq_saturation)) * 100

This lets us scale pods based on the current saturation plus the additional saturation required for jobs waiting in the queue.

Copy link
Member

@Envek Envek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi! Thank you for contributing to Yabeda!

My main concern here is that every worker process reports aggregated data about all workers (and there may be hundreds of them).

Isn't it better to report data about only this worker and let monitoring system to aggregate it? (By the way, what are you using?)

@@ -34,6 +34,10 @@ module Sidekiq
gauge :active_processes, tags: [], comment: "The number of active Sidekiq worker processes."
gauge :queue_latency, tags: %i[queue], comment: "The queue latency, the difference in seconds since the oldest job in the queue was enqueued"

gauge :concurrency, tags: [], comment: "The total number of jobs that can be run at a time across all processes."
gauge :available_workers, tags: [], comment: "The number of workers available for new jobs across all processes."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the metric name I can't understand whether it is about processes or threads. I need to read the collector's code below to get it. Let's clarify:

Suggested change
gauge :available_workers, tags: [], comment: "The number of workers available for new jobs across all processes."
gauge :available_threads, tags: [], comment: "The number of threads available for job processing across all processes."

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name was chosen to be consistent with how it’s reporting active workers — it looks like Sidekiq uses the term “worker” to mean “thread”.

@jeisen
Copy link
Author

jeisen commented May 24, 2020

I originally tried to do it per-process and let Prometheus aggregate it, but Sidekiq isn’t reporting it that way so I had to do it like this.

The scaling is done with the Prometheus Exporter process making these stats available as custom metrics for the Kubernetes Horizontal Pod Autoscaler. To avoid it overloading Prometheus with queries, it’s set up as a service-level metric instead of per-pod.

@hbontempo-br
Copy link

Hi!

I saw the feature and specially the total number of threads is interesting (we already have the busy sidekiq_active_workers_count, therefore the available and saturation metrics are not really required since we can derive then from other metrics).

My main concern here is that every worker process reports aggregated data about all workers (and there may be hundreds of them).

By what I saw when this PR was opened there was no collect_cluster_metrics configuration yet.
Since we have it now, I believe that this is not a big concern as it was before, right?

@Envek are you open to consider this feature again? Maybe a bit smaller (just a sidekiq_workers_count) and under the collect_cluster_metrics configuration?
I open to help with code if you think the addition is valid and want help

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants