OTEL_SEMCONV_STABILITY_OPT_IN latency buckets to big #3011

bergur88 · 2024-11-15T08:34:37Z

Describe your environment

services are build with docker python:3.10.15-slim and run on k8s
services use
opentelemetry-api==1.27.0
opentelemetry-sdk==1.27.0
opentelemetry-propagator-b3==1.27.0
opentelemetry-exporter-otlp-proto-grpc==1.27.0
opentelemetry-instrumentation-fastapi==0.48b0
opentelemetry-instrumentation-aiohttp-client==0.48b0
opentelemetry-instrumentation-asyncpg==0.48b0
opentelemetry-instrumentation-psycopg==0.48b0
opentelemetry-instrumentation-psycopg2==0.48b0
opentelemetry-instrumentation-requests==0.48b0
opentelemetry-instrumentation-logging==0.48b0
opentelemetry-instrumentation-system-metrics==0.48b0
opentelemetry-instrumentation-grpc==0.48b0

What happened?

I'm using the OTEL_SEMCONV_STABILITY_OPT_IN feature (I'm currently running http/dup ) and am seeing some weird results with http latencies. It seems to me to use the same bucket sizes as the old metrics. Doesn't the buckets need to be smaller since the unit has been changed from milliseconds to seconds, with the lowest bucket being 5 seconds it not particularly useful and most percentiles calculated from my metrics show that p99 for most of my services/paths are 5 seconds which is not very accurate.

nodejs and dotnet overwrite the default buckets with more sane values.

The images show the same metric during the same time for the same labelset as a histogram and the older one being more granular and useful.
sum(rate(http_server_duration_milliseconds_bucket{app="x", environment="dev"}[$__rate_interval])) by (le)
sum(rate(http_server_request_duration_seconds_bucket{app="x", environment="dev"}[$__rate_interval])) by (le)

Steps to Reproduce

set OTEL_SEMCONV_STABILITY_OPT_IN="http/dup"

it can then be visualized in graphana similarly to this:
sum(rate(http_server_duration_milliseconds_bucket{app="x", environment="dev"}[$__rate_interval])) by (le)
sum(rate(http_server_request_duration_seconds_bucket{app="x", environment="dev"}[$__rate_interval])) by (le)

Expected Result

I expected to see the same percentiles for my services/paths using the semantic metrics.

Actual Result

new metrics are scewed towards 5 seconds because of buckets sizes.

Additional context

No response

Would you like to implement a fix?

None

The text was updated successfully, but these errors were encountered:

bergur88 added the bug Something isn't working label Nov 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OTEL_SEMCONV_STABILITY_OPT_IN latency buckets to big #3011

OTEL_SEMCONV_STABILITY_OPT_IN latency buckets to big #3011

bergur88 commented Nov 15, 2024

OTEL_SEMCONV_STABILITY_OPT_IN latency buckets to big #3011

OTEL_SEMCONV_STABILITY_OPT_IN latency buckets to big #3011

Comments

bergur88 commented Nov 15, 2024

Describe your environment

What happened?

Steps to Reproduce

Expected Result

Actual Result

Additional context

Would you like to implement a fix?