You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
I am maintaining the Odigos project, which deploys collectors in Kubernetes environments to set up a telemetry pipeline for collecting, processing, and exporting data to various destinations.
Odigos uses a two-layer collector design:
DaemonSets (Node-Level Collectors): Handle telemetry locally on each node.
Cluster Collectors: Auto-scaled Deployments for centralized processing.
We utilize node-level collectors to ensure local data export and offload concerns like batching, retries, buffering, and cluster-wide networking from users' applications. However, the pipeline can experience pressure under specific conditions:
Downstream Backpressure: If a downstream component refuses data, queues grow, leading to increased memory and CPU usage.
Data Bursts: Sudden traffic spikes may overwhelm node collectors before cluster collectors scale.
Bugs or Configuration Issues: Errors or specific data patterns (e.g., large spans) can cause inefficiencies in handling the load.
Our objective is to buffer and retry within the collectors during transient failures or bursts, preventing backpressure from impacting users' applications. However, if memory pressure builds up, we want to avoid returning retryable errors to applications, which could inadvertently increase their resource usage.
Describe the solution you'd like
To address this, I propose enhancing the Memory Limiter Processor with a new configuration option:
New Option: Introduce a boolean flag to control whether the processor should drop data instead of returning retryable errors during memory pressure.
Default Behavior: Maintain the current behavior (returning retryable errors).
Opt-In Behavior: When enabled, the processor would drop data under memory pressure rather than propagating errors back to applications.
This change involves adding the new configuration option and updating the processor's logic here and for other signals, enabling it to either refuse data or drop it based on the setting.
Describe alternatives you've considered
I am open to other approaches to configure the collector to drop data under pressure instead of returning retryable errors. Feedback on the best way to achieve this behavior is welcome.
Additional context
If there is support for this feature, I am willing to contribute by creating a PR.
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem? Please describe.
I am maintaining the Odigos project, which deploys collectors in Kubernetes environments to set up a telemetry pipeline for collecting, processing, and exporting data to various destinations.
Odigos uses a two-layer collector design:
We utilize node-level collectors to ensure local data export and offload concerns like batching, retries, buffering, and cluster-wide networking from users' applications. However, the pipeline can experience pressure under specific conditions:
Our objective is to buffer and retry within the collectors during transient failures or bursts, preventing backpressure from impacting users' applications. However, if memory pressure builds up, we want to avoid returning retryable errors to applications, which could inadvertently increase their resource usage.
Describe the solution you'd like
To address this, I propose enhancing the Memory Limiter Processor with a new configuration option:
This change involves adding the new configuration option and updating the processor's logic here and for other signals, enabling it to either refuse data or drop it based on the setting.
Describe alternatives you've considered
I am open to other approaches to configure the collector to drop data under pressure instead of returning retryable errors. Feedback on the best way to achieve this behavior is welcome.
Additional context
If there is support for this feature, I am willing to contribute by creating a PR.
The text was updated successfully, but these errors were encountered: