Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OTEL_SEMCONV_STABILITY_OPT_IN latency buckets to big #3011

Open
bergur88 opened this issue Nov 15, 2024 · 0 comments
Open

OTEL_SEMCONV_STABILITY_OPT_IN latency buckets to big #3011

bergur88 opened this issue Nov 15, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@bergur88
Copy link

Describe your environment

services are build with docker python:3.10.15-slim and run on k8s
services use
opentelemetry-api==1.27.0
opentelemetry-sdk==1.27.0
opentelemetry-propagator-b3==1.27.0
opentelemetry-exporter-otlp-proto-grpc==1.27.0
opentelemetry-instrumentation-fastapi==0.48b0
opentelemetry-instrumentation-aiohttp-client==0.48b0
opentelemetry-instrumentation-asyncpg==0.48b0
opentelemetry-instrumentation-psycopg==0.48b0
opentelemetry-instrumentation-psycopg2==0.48b0
opentelemetry-instrumentation-requests==0.48b0
opentelemetry-instrumentation-logging==0.48b0
opentelemetry-instrumentation-system-metrics==0.48b0
opentelemetry-instrumentation-grpc==0.48b0

What happened?

I'm using the OTEL_SEMCONV_STABILITY_OPT_IN feature (I'm currently running http/dup ) and am seeing some weird results with http latencies. It seems to me to use the same bucket sizes as the old metrics. Doesn't the buckets need to be smaller since the unit has been changed from milliseconds to seconds, with the lowest bucket being 5 seconds it not particularly useful and most percentiles calculated from my metrics show that p99 for most of my services/paths are 5 seconds which is not very accurate.

nodejs and dotnet overwrite the default buckets with more sane values.

image
image

The images show the same metric during the same time for the same labelset as a histogram and the older one being more granular and useful.
sum(rate(http_server_duration_milliseconds_bucket{app="x", environment="dev"}[$__rate_interval])) by (le)
sum(rate(http_server_request_duration_seconds_bucket{app="x", environment="dev"}[$__rate_interval])) by (le)

Steps to Reproduce

set OTEL_SEMCONV_STABILITY_OPT_IN="http/dup"

it can then be visualized in graphana similarly to this:
sum(rate(http_server_duration_milliseconds_bucket{app="x", environment="dev"}[$__rate_interval])) by (le)
sum(rate(http_server_request_duration_seconds_bucket{app="x", environment="dev"}[$__rate_interval])) by (le)

Expected Result

I expected to see the same percentiles for my services/paths using the semantic metrics.

Actual Result

new metrics are scewed towards 5 seconds because of buckets sizes.

Additional context

No response

Would you like to implement a fix?

None

@bergur88 bergur88 added the bug Something isn't working label Nov 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant