Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Critical path tracing shows the wrong critical path for async script #2371

Open
magicmark opened this issue Jul 7, 2024 · 13 comments
Labels

Comments

@magicmark
Copy link

magicmark commented Jul 7, 2024

👋 I've been on a bit of journey trying ot understand how critical path tracing works. I'm still not 100% so apologies if this is wrong, but would appreciate y'alls input :)

Full context here open-telemetry/opentelemetry-specification#4079

Anyway, I was able to come up with the following minimal repro:

image

This visualization appears to be wrong ^

(JSON)

Per the following code (which produces this trace), /get foo3 is not blocked by /get foo2:

async def make_fake_network_request(resource_name, length):
    with tracer.start_as_current_span(f'get /{resource_name}'):
        await asyncio.sleep(length)

async def foo1():
    await make_fake_network_request("foo1", 0.1)
    await asyncio.sleep(0.1)
    await foo3()

async def foo2():
    await make_fake_network_request("foo2", 0.2)

async def foo3():
    await make_fake_network_request("foo3", 0.3)

@tracer.start_as_current_span("main")
async def main():
    await asyncio.gather(foo1(), foo2())

asyncio.run(main())
full code (including imports)
import asyncio

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

from opentelemetry.sdk.trace.export import (
    BatchSpanProcessor,
    SimpleSpanProcessor,
    ConsoleSpanExporter,
)

trace.set_tracer_provider(TracerProvider())
trace.get_tracer_provider().add_span_processor(
    SimpleSpanProcessor(OTLPSpanExporter(endpoint="http://localhost:4317"))
)
trace.get_tracer_provider().add_span_processor(
    SimpleSpanProcessor(ConsoleSpanExporter())
)
tracer = trace.get_tracer("async_example")

async def make_fake_network_request(resource_name, length):
    with tracer.start_as_current_span(f'get /{resource_name}'):
        await asyncio.sleep(length)

async def foo1():
    await make_fake_network_request("foo1", 0.1)
    await asyncio.sleep(0.1)
    await foo3()

async def foo2():
    await make_fake_network_request("foo2", 0.2)

async def foo3():
    await make_fake_network_request("foo3", 0.3)

@tracer.start_as_current_span("main")
async def main():
    await asyncio.gather(foo1(), foo2())

asyncio.run(main())

I understand that the algorithm walks backwards, but i'm guessing it can only do so heuristically? Or is there something else i'm missing here? (I don't know if we provide the "previous sibling span id"? Is this an issue with the spec itself (hence the other issue)?)

Thanks!

Jaeger backend version

1.58.0

SDK

python = "^3.12"
opentelemetry-api = "^1.25.0"
opentelemetry-exporter-otlp = "^1.25.0"
opentelemetry-sdk = "^1.25.0"

Pipeline

  • Running the prebuilt jaeger-1.58.0-darwin-amd64/jaeger-all-in-one locally on my macbook.
  • Python script pointed at http://localhost:4317

Operating system

MacOS

Deployment model

Running jaeger-all-in-one locally

@magicmark magicmark added the bug label Jul 7, 2024
@magicmark magicmark changed the title [Bug]: Critical path tracing shows the wrong critical path. [Bug]: Critical path tracing shows the wrong critical path for async script Jul 7, 2024
@yurishkuro
Copy link
Member

@dosubot

@yurishkuro
Copy link
Member

All three foo spans are siblings in your trace (which does not exactly match the code structure where, one could argue, foo3 is a child of foo1 because foo1 doesn't succeed without foo3). There is no "right" way to do critical path when you just have equal siblings, Jaeger simply shows how the execution happened. The most basic definition of critical path is:

a segment is on the CP if making the execution of that segment shorter would make the whole request shorter

I believe Jaeger's depiction meets this definition.

What were you expecting to see?

@yurishkuro
Copy link
Member

Also, Jaeger is not claiming that foo3 is blocked by foo2, that's just what your trace is telling us has happened.

@yurishkuro
Copy link
Member

@jkowall Dosu seems not to pick up UI issues, this would've been a good one to LLM.

@magicmark
Copy link
Author

magicmark commented Jul 7, 2024

@yurishkuro

a segment is on the CP if making the execution of that segment shorter would make the whole request shorter

I believe Jaeger's depiction meets this definition.

In this case, making foo2() shorter will not make the whole request shorter. I think this violates the definition?

What were you expecting to see?

The true critical path is foo1() -> foo2(). That would be the critical path I as a human would highlight in the UI. (There would be a bit of unannotated unaccounted for time.)

Jaeger is not claiming that foo3 is blocked by foo2, that's just what your trace is telling us has happened.

Is this what the trace says happened? I looked at the JSON and I don't see that there's any indication that says foo3 is blocked on foo2. The only thing we know is that they're technically all sibling spans of main() in terms of nesting of annotations. (i.e. refType=CHILD_OF. No FOLLOWS_FROM.) But I'm still diving into this for the first time - could you elaborate on where in the JSON that the trace tells jaeger otherwise?

@magicmark
Copy link
Author

Just to elaborate on this:

All three foo spans are siblings in your trace (which does not exactly match the code structure where, one could argue, foo3 is a child of foo1 because foo1 doesn't succeed without foo3).

fwiw yes I agree that in an idealized world, the foo spans might have their own explicit annotations, making this an easier problem to solve. However this is a minimal repro from an issue i'm seeing at scale, where the only thing we have guaranteed annotations for is outbound network requests, in a sea of deeply nested async code - making it difficult to know what was kicked of where by who. This minimal repro is very representative of a real world problem.

@yurishkuro
Copy link
Member

In this case, making foo2() shorter will not make the whole request shorter. I think this violates the definition?

You are basing that on your knowledge of the code, but that's not reflected in the trace. Jaeger can only make decision based on trace information, and in that trace the program was first "doing work" in foo2, then some work in main, then foo3.

could you elaborate on where in the JSON that the trace tells jaeger otherwise?

It's in the timing of the span. First there was work in foo2, then in foo3. That's all that Jaeger knows.

As I mentioned, if you want the trace to more accurately reflect the dependencies, you should have foo3 be a child of foo1. It would then affect how the CP is displayed. When spans are just siblings, Jaeger cannot make any assumptions - the could be running in parallel, OR they could have hidden inter-dependencies (e.g. foo1 and foo3 could be true siblings and yet have a logical dependency foo3 follows-from foo1).

@magicmark
Copy link
Author

magicmark commented Jul 7, 2024

You are basing that on your knowledge of the code, but that's not reflected in the trace.

I think that's precisely the point i'm getting at here!

Jaeger fundamentally doesn't have enough information provided in this trace to accurately confidently say "foo2 is in the critical path and i will highlight it as such".

It's in the timing of the span. First there was work in foo2, then in foo3. That's all that Jaeger knows.

I think this minimal repro shows where this assumption that this is enough to calculate CP breaks down when you have concurrent spans.

When spans are just siblings, Jaeger cannot make any assumptions

Per the above, it looks like Jaeger is making assumptions?

To clarify, I think we're in agreement that:

  • this would be "solved" by more explicit annotations and nesting of spans
  • jaeger can only show what is in the trace - this is more of an issue of what's provided in the trace json

My point is that there doesn't seem to be enough information currently provided in the trace JSON for jaeger to say that "foo2 is in the critical path" - because as we humans can see, it isn't!

Maybe in such cases, Jaeger could color the line differently to highlight this is a heuristic/best guess? I think that would help boost confidence in using this feature (otherwise it's difficult to know when it's accurate or not)

@yurishkuro
Copy link
Member

I'll ask again - what do you expect to see? CP is a heuristic, and yes Jaeger is making assumptions but based on the data present in the span (which eg. shows foo2 was running solo and therefor WAS on the critical path, same as foo3). If either of those weren't in the program the parent span would've been shorter.

@magicmark
Copy link
Author

magicmark commented Jul 7, 2024

CP is a heuristic

Ah I think "CP is a heuristic" is probably the key bit of information i was missing. TIL. Purely looking at the UI this wasn't clear to me that this should be assumed to be a heurstic.

The only thing I'm able to see on this is the popover:

Screenshot 2024-07-07 at 4 41 13 PM

TBH I don't think it's currently super obvious to users that this is a heuristic.

I'll ask again - what do you expect to see?

Again:

Maybe in such cases, Jaeger could color the line differently to highlight this is a heuristic/best guess? I think that would help boost confidence in using this feature (otherwise it's difficult to know when it's accurate or not)

Additionally:

Is this documented somewhere? I took a look on the docs (https://www.jaegertracing.io/docs/1.18/frontend-ui/) and couldn't see anything related to CP (mentioning that it's a heuristic or otherwise) - maybe I couldn't find the page though?

I would either accept the status quo and:

  • add a pointer in this page ^ documenting this feature, clarifying that it's a heuristic
  • color the CP differently where we can't confidently say what is CP or not (in some cases maybe we can?)

OR maybe this solvable with a spec change + smarter SDK? open-telemetry/opentelemetry-specification#4079 related

foo2 was running solo and therefor WAS on the critical path

I could shorten or remove foo2 altogether and it wouldn't make the request any shorter. Based on my understanding and your definition above ("if making the execution of that segment shorter would make the whole request shorter"), foo2 is not on the critical path.

@yurishkuro
Copy link
Member

add a pointer in this page ^ documenting this feature, clarifying that it's a heuristic

we could add a note in the popup

@magicmark
Copy link
Author

magicmark commented Jul 7, 2024

Sounds like a good idea. I would also encourage exploring the idea of coloring the CP differently in cases where there is ambiguity (such as here). In a world where there are only sequential non async spans (and maybe this is provided as an input via the SDK) then the current approach seems sound. In an a microservice environment you may have a mix of both.

@yurishkuro
Copy link
Member

I wouldn't do different coloring. OTEL data model is fundamentally lacking certain causality information, so the algorithm is always a heuristic, since we never know the true causal graph of events.

Feel free to open a PR to add a warning to the tooltip.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants