You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
The collector's components should be observable. Intuitively, to some extent there must be some instrumentation which can be applied to components in a universal and uniform way.
Goals of this proposal:
Articulate a mechanism which enables us to automatically capture signals from all pipeline components.
Define attributes that are (A) specific enough to describe individual component instances and (B) consistent enough for correlation across signals.
Define specific metrics for each kind of pipeline component.
Define specific spans for processors and connectors.
Define specific logs for all kinds of pipeline component.
Describe the solution you'd like
Mechanism
The mechanism of telemetry capture should be external to components. Specifically, we should inject instrumentation at each point where a component passes data to another component, and, at each point where a component consumes data from another component. In terms of the component graph, this means that every edge in the graph will have two layers of instrumentation - one for the producing component and one for the consuming component. Importantly, each layer generate telemetry which is ascribed to a single component instance, so by having two layers per edge, we can describe both sides of each handoff independently. In the case of processors and connectors, the appropriate layers can act in concert (e.g. record the start and end of a span).
Attributes are discussed in detail in #11179, though only in the context of metrics. To restate the consensus from that issue, all signals should use the following attributes:
Receivers
receiver: The component ID
otel.signal: logs, metrics, traces, OR ALL
Processors
processor: The component ID
otel.signal: logs, metrics, traces, OR ALL
pipeline: The pipeline ID, OR ALL
Exporters
exporter: The component ID
otel.signal: logs, metricstraces, OR ALL
Connectors
connector: The component ID
otel.signal: logs->logs, logs->metrics, logs->traces, metrics->logs, metrics->metrics, etc, OR ALL
Notes: The use of ALL is based on the assumption that components are instanced either in the default way, or, as a single instance per configuration (e.g. otlp receiver).
Metrics
There are two straightforward measurements that can be made on any pdata:
A count of "items" (spans, data points, or log records). These are low cost but broadly useful, so they should be enabled by default.
A measure of size, based on ProtoMarshaler.Sizer(). These are high cost to compute, so by default they should be disabled (and not calculated).
The location of these measurements can be described in terms of whether the data is "incoming" or "outgoing".
Incoming measurements are attributed to the component which is consuming the data.
Outgoing measurements are attributed to the component which is producing the data.
An outcome attribute with possible values success and failure should be automatically recorded, corresponding to whether or not the function call returned an error.
Finally, the kind of component being described should be captured as an attribute, since other aspects of the metrics are generally the same (except instance-specific attributes).
With all of the above in mind, the following metrics should be captured:
otelcol_component_incoming_items:
enabled: truedescription: Number of items passed to the component.unit: "{items}"sum:
value_type: intmonotonic: trueotelcol_component_outgoing_items:
enabled: truedescription: Number of items emitted from the component.unit: "{items}"sum:
value_type: intmonotonic: trueotelcol_component_incoming_size:
enabled: falsedescription: Size of items passed to the component.unit: "By"sum:
value_type: intmonotonic: trueotelcol_component_outgoing_size:
enabled: falsedescription: Size of items emitted from the component.unit: "By"sum:
value_type: intmonotonic: true
Spans
When adding instrumentation layers to processors and connectors, we should be able to record the start and end of a span.
Logs
Metrics and spans provide most of the observability we need but there are some gaps which logs can fill. For example, we can record spans for processors and connectors but logs are useful for capturing precise timing as it relates to data produced by receivers and consumed by exporters. Additionally, although metrics would describe the overall item counts, it is helpful in some cases to record more granular events. e.g. If an outgoing batch of 10,000 spans results in an error, but 100 batches of 100 spans succeed, this may be a matter of batch size that can be detected by analyzing logs, while the corresponding metric reports only that a 50% success rate is observed.
For security and performance reasons, I do not believe in any case it would be appropriate to log the contents of telemetry.
We should also consider that it's very easy for logs to become too noisy. Even if errors are occurring frequently in the data pipeline, they may only be of interest to many users if they are not handled automatically.
With the above considerations, I propose only that we add a DEBUG log for each individual outcome. This should be sufficient for detailed troubleshooting but does not impact users otherwise.
In the future, it may be helpful to define triggers for reporting repeated failures at a higher severity level. e.g. N number of failures in a row, or a moving average success %. For now, the criteria and necessary configurability is unclear to me so I mention this only as an example of future possibilities.
Additional context
This proposal pulls from a number of smaller scoped issues and PRs:
Thanks for opening this issue @djaglowski, would you mind opening a RFC that includes the telemetry as described in this issue? This would be great to have documented for future us
Is your feature request related to a problem? Please describe.
The collector's components should be observable. Intuitively, to some extent there must be some instrumentation which can be applied to components in a universal and uniform way.
Goals of this proposal:
Describe the solution you'd like
Mechanism
The mechanism of telemetry capture should be external to components. Specifically, we should inject instrumentation at each point where a component passes data to another component, and, at each point where a component consumes data from another component. In terms of the component graph, this means that every edge in the graph will have two layers of instrumentation - one for the producing component and one for the consuming component. Importantly, each layer generate telemetry which is ascribed to a single component instance, so by having two layers per edge, we can describe both sides of each handoff independently. In the case of processors and connectors, the appropriate layers can act in concert (e.g. record the start and end of a span).
Attributes are discussed in detail in #11179, though only in the context of metrics. To restate the consensus from that issue, all signals should use the following attributes:
Receivers
receiver
: The component IDotel.signal
:logs
,metrics
,traces
, ORALL
Processors
processor
: The component IDotel.signal
:logs
,metrics
,traces
, ORALL
pipeline
: The pipeline ID, ORALL
Exporters
exporter
: The component IDotel.signal
:logs
,metrics
traces
, ORALL
Connectors
connector
: The component IDotel.signal
:logs->logs
,logs->metrics
,logs->traces
,metrics->logs
,metrics->metrics
, etc, ORALL
Notes: The use of
ALL
is based on the assumption that components are instanced either in the default way, or, as a single instance per configuration (e.g. otlp receiver).Metrics
There are two straightforward measurements that can be made on any pdata:
The location of these measurements can be described in terms of whether the data is "incoming" or "outgoing".
An
outcome
attribute with possible valuessuccess
andfailure
should be automatically recorded, corresponding to whether or not the function call returned an error.Finally, the kind of component being described should be captured as an attribute, since other aspects of the metrics are generally the same (except instance-specific attributes).
With all of the above in mind, the following metrics should be captured:
Spans
When adding instrumentation layers to processors and connectors, we should be able to record the start and end of a span.
Logs
Metrics and spans provide most of the observability we need but there are some gaps which logs can fill. For example, we can record spans for processors and connectors but logs are useful for capturing precise timing as it relates to data produced by receivers and consumed by exporters. Additionally, although metrics would describe the overall item counts, it is helpful in some cases to record more granular events. e.g. If an outgoing batch of 10,000 spans results in an error, but 100 batches of 100 spans succeed, this may be a matter of batch size that can be detected by analyzing logs, while the corresponding metric reports only that a 50% success rate is observed.
For security and performance reasons, I do not believe in any case it would be appropriate to log the contents of telemetry.
We should also consider that it's very easy for logs to become too noisy. Even if errors are occurring frequently in the data pipeline, they may only be of interest to many users if they are not handled automatically.
With the above considerations, I propose only that we add a DEBUG log for each individual outcome. This should be sufficient for detailed troubleshooting but does not impact users otherwise.
In the future, it may be helpful to define triggers for reporting repeated failures at a higher severity level. e.g. N number of failures in a row, or a moving average success %. For now, the criteria and necessary configurability is unclear to me so I mention this only as an example of future possibilities.
Additional context
This proposal pulls from a number of smaller scoped issues and PRs:
The text was updated successfully, but these errors were encountered: