Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

only kernel stack frames reported inside orbstack docker container on macOS #170

Open
tmm1 opened this issue Sep 29, 2024 · 24 comments
Open

Comments

@tmm1
Copy link

tmm1 commented Sep 29, 2024

i'm using orbstack on an arm64 mac to profile a linux app.

perf works as expected and shows me native symbols from user-land activity

but in devfiler i'm not seeing much come through. it seems connected, there are no errors, but there is very little data. i see some kernel symbols.

how can i debug this further?

@rockdaboot
Copy link
Contributor

but there is very little data

Is the samples timeline and/or the flamegraph empty or are you missing symbols?

If you see frames from your app without further information (symbols missing):

  • make sure your app is built with debug symbols
  • drag&drop you app into the devfiler window to extract symbols

If that doesn't help, you can enable the "dev" mode by double-clicking on the icon on the left of "devfiler" menu entry. You then see some more menu items. Check if you get gRPC messages and that DB stats show entries for TraceEvents, Stacktraces etc and let us know what you see.

@tmm1
Copy link
Author

tmm1 commented Sep 30, 2024

the flamegraph is empty:

Screenshot 2024-09-29 at 1 30 32 PM

im' not seeing much data in the dev menus:

Screenshot 2024-09-29 at 11 51 40 PM Screenshot 2024-09-29 at 11 52 18 PM

@tmm1
Copy link
Author

tmm1 commented Sep 30, 2024

tried with a UTM.app VM and same thing. then i tried the old devfiler 0.6.0 and it started showing a lot more data! it still has those build-id errors, so i will downgrade to 7d2285e for now

EDIT: the old builds work in the VM but still not inside orbstack. will try inside docker in the VM next

@tmm1
Copy link
Author

tmm1 commented Sep 30, 2024

in UTM it works on the host:

Linux ubuntu 6.2.0-39-generic #40-Ubuntu SMP PREEMPT_DYNAMIC Tue Nov 14 23:07:44 UTC 2023 aarch64 aarch64 aarch64 GNU/Linux

but inside docker i get:

ERRO[0000] Failed to load eBPF tracer: failed to load eBPF code: your kernel version 6.2.0 is affected by a Linux kernel bug that can lead to system freezes, terminating host agent now to avoid triggering this bug

inside orbstack the kernel is newer:

Linux orbstack 6.10.11-orbstack-00280-g1304bd068592 #21 SMP Sat Sep 21 10:45:28 UTC 2024 aarch64 GNU/Linux

@tmm1
Copy link
Author

tmm1 commented Sep 30, 2024

The issue seems specific to running within a container. Only kernel activity is shown. Reproduced via docker exec on Ubuntu VM.

@tmm1 tmm1 changed the title not getting user-level data from macOS orbstack container not getting user-level data when running inside docker containers Sep 30, 2024
@rockdaboot
Copy link
Contributor

The issue seems specific to running within a container. Only kernel activity is shown. Reproduced via docker exec on Ubuntu VM.

Just to clarify, the profiler is a system-wide profiler, so it requires root privileges. Can you confirm that you run the docker container with --privileged?

@rockdaboot
Copy link
Contributor

Ideally, try running the docker container with something like

docker run --privileged --pid=host -v /etc/machine-id:/etc/machine-id:ro \
-v /var/run/docker.sock:/var/run/docker.sock -v /sys/kernel/debug:/sys/kernel/debug:ro ...

@tmm1
Copy link
Author

tmm1 commented Oct 1, 2024

Yes I was using --net=host --privileged=true but I will try those other options

@tmm1
Copy link
Author

tmm1 commented Oct 1, 2024

Thanks, with those options it works in docker on Ubuntu. I see all the information from the host too which isn't ideal, but better than nothing.

In orbstack there's still only kernel-data, but I assume that's something specific to that environment.

@rockdaboot
Copy link
Contributor

rockdaboot commented Oct 1, 2024

Thanks, with those options it works in docker on Ubuntu.

Thanks for testing.

I see all the information from the host too which isn't ideal, but better than nothing.

That's exactly what the profiler has been designed for: getting all information from the host while doing continuous profiling. Filtering is assumed to be done on the backend or by the user interface.

But if you think that limiting the view/collection of the profiler is a realistic use case, please open a separate issue with your ideas for discussion.

In orbstack there's still only kernel-data, but I assume that's something specific to that environment.

Maybe someone working on MacOS can chime in here.
Could you please reword the GH issue title and possibly provide profiler logs when starting with -v and/or -bpf-log-level 2?

@tmm1 tmm1 changed the title not getting user-level data when running inside docker containers only kernel stack frames reported inside orbstack docker container on macOS Oct 1, 2024
@leonard520
Copy link

I meet a similar issue on container running on K8s. There is only kernel stack frames. The base image of the container is ubuntu. The agent can work in a VM environment. In the VM, I am able to see other frames, like Java or Python.

I have grant the privileged to the container.

securityContext:
             allowPrivilegeEscalation: true
             capabilities:
               add:
               - CAP_SYS_ADMIN
             privileged: true

One thing to note is that the when the container is up and run ebpf-profiler for the first time, it will fail due to the below error

ERRO[0000] Failed to probe tracepoint: failed to get id for tracepoint: failed to read tracepoint ID for sys_enter_mmap: open /sys/kernel/debug/tracing/events/syscalls/sys_enter_mmap/id: no such file or directory

I fixed it by mount debugfs and tracefs

sudo mount -t debugfs none /sys/kernel/debug
sudo mount -t tracefs none /sys/kernel/debug/tracing

I attach the log in attachment with option -v
ebpf-profiler.log

@rockdaboot
Copy link
Contributor

@leonard520 Regarding the "Failed to probe tracepoint": Can you please update to the latest ebpf-profiler? The tracepoint check has been dropped meanwhile.

@leonard520
Copy link

@tmm1 I just tried the latest main branch. It still has the error "Failed to probe tracepoint". Does it come from here

@rockdaboot
Copy link
Contributor

@tmm1 I just tried the latest main branch. It still has the error "Failed to probe tracepoint". Does it come from here

Sorry, my fault. The code change I was referring to is still in work: #175

@leonard520
Copy link

leonard520 commented Oct 29, 2024

@rockdaboot Do you have any clue why there is only kernel stack frames in my container environment? Feel free to let me know if you want to me to try something.

@rockdaboot
Copy link
Contributor

@leonard520 I assume that for some reason the unwinder runs into an error. Ideally, we could reproduce this somehow on amd64 (can you?). @fabled, maybe you can have look at the above ebpf-profiler.log - I don't find anything in there that helps.

@rockdaboot
Copy link
Contributor

@leonard520 I assume that you use devfiler for visualization. Can you run the profiler with -send-error-frames? I assume that you see the error frames in red directly under the root frame in the flamegraph. It tells you why the unwinding failed, hopefully that is a hint.

@leonard520
Copy link

@rockdaboot Today I reproduce the issue again for a longer time. I notice there are a lot of log like below
DEBU[0006] Failed to get a cgroupv2 ID as container ID for PID 1006390: open /proc/1006390/cgroup: no such file or directory

I tried to do ps both in container and host worker node. The PID looks like in the host worker node instead of the container itself. As a result, the directory /proc/1006390/cgroup: only exists in node. I am wondering if this is a problem.

@rockdaboot
Copy link
Contributor

I am wondering if this is a problem.

No this is not a problem. I made a test where I changed the code to look into cgroupxxx here, to trigger exactly this error every time. I still see all kinds of frames.

@leonard520
Copy link

@rockdaboot Thanks for the verification. I did another test in container and record the below things. I am wondering how the profiler list all the PIDs to trace.

  1. Check the PID for Java process in container
ps aux | grep java
root         **289**  9.5  0.1 2598068 101724 pts/2  Sl+  03:13   0:03 java -Dserver.port=8888 -jar demo.jar
root         313  0.0  0.0   6508  2244 pts/3    S+   03:13   0:00 grep --color=auto java
  1. Check the PID for Java process in node
root     **1888084**  8.9  0.2 3596600 134912 ?      Sl+  03:13   0:05 java -Dserver.port=8888 -jar demo.jar
  1. Check PID information in log. I am not able to find PID 289 but only for 1888084, however, for 1888084, I found the below messages. It looks to me that the PID can't be parsed.
DEBU[0056] => PID: 1888084
DEBU[0056] = PID: 1888084
DEBU[0056] - PID: 1888084
DEBU[0056] Skip process exit handling for unknown PID 1888084
DEBU[0057] => PID: 1888084
DEBU[0057] = PID: 1888084
DEBU[0057] - PID: 1888084
DEBU[0057] Skip process exit handling for unknown PID 1888084
DEBU[0058] Failed to get a cgroupv2 ID as container ID for PID 1888084: open /proc/1888084/cgroup: no such file or directory
DEBU[0058] => PID: 1888084
DEBU[0058] = PID: 1888084
DEBU[0058] - PID: 1888084
DEBU[0058] Skip process exit handling for unknown PID 1888084

Attach the full log for reference.
ebpf-bad.log

@Gandem
Copy link
Contributor

Gandem commented Oct 31, 2024

@leonard520 If I understand correctly you're running the profiler in Kubernetes, right?

In that case, you will need to ensure the following is set:

      hostPID: true # Setting hostPID to true on the Pod so that the PID namespace is that of the host
      containers:
        ...
        securityContext:
          runAsUser: 0
          privileged: true # Running in privileged mode
          procMount: Unmasked # Setting procMount to Unmasked
          capabilities:
            add:
            - SYS_ADMIN # Adding SYS_ADMIN capability

Specifically, setting hostPID: true and procMount: Unmasked should ensure that the PIDs align between the container and the host.

@leonard520
Copy link

@Gandem I think your answer clarified my confusion. Thank you very much. After trying to add your spec to my pod, I encountered this error.

INFO[0070] eBPF tracer loaded
ERRO[0080] Failed to handle mapping for PID 21720, file /pause: failed to extract interval data: failed to extract stack deltas from /pause: failure to parse golang stack deltas: failed to load .gopclntab section: EOF
INFO[0090] Attached tracer program
INFO[0090] Attached sched monitor
ERRO[0092] Request failed: rpc error: code = InvalidArgument desc = mapping is missing attributes
ERRO[0096] Request failed: rpc error: code = InvalidArgument desc = mapping is missing attributes

I am wondering if it is related with the pause container is written in go and I will take a further look.

On the other hand, I’m also considering whether this approach has security risks. Sharing the same PID namespace with the host reduces isolation and increases the potential for container escape.

@felixge
Copy link
Member

felixge commented Nov 14, 2024

In orbstack there's still only kernel-data, but I assume that's something specific to that environment.

I hit the same issue with orbstack. I suspect that issue is caused by the fact that:

OrbStack runs full-blown Linux machines that work almost exactly like traditional virtual machines

The word "almost" seems to refer to the fact that the VMs seem to actually be containers of their own. At least this is what I'm looking for signs of this inside of the Orb VM:

$ sudo cat /proc/1/environ
container=lxc

In practice this means that the PIDs seen by the VM are not the same PIDs as seen by the kernel, which breaks the eBPF profiler. I tried working around this by running the profiler via docker run --pid=host ..., but I wasn't able to make it work. I suspect the usage of host PIDs is currently not supported by OrbStack.

Anyway, I ended up firing up a real linux VM in the cloud. If somebody still figures out a way to make OrbStack work, that'd be great, but for now it should probably be considered an unsupported environment.

@christos68k
Copy link
Member

christos68k commented Nov 15, 2024

@Gandem I think your answer clarified my confusion. Thank you very much. After trying to add your spec to my pod, I encountered this error.
ERRO[0092] Request failed: rpc error: code = InvalidArgument desc = mapping is missing attributes
ERRO[0096] Request failed: rpc error: code = InvalidArgument desc = mapping is missing attributes


I am wondering if it is related with the `pause` container is written in go and I will take a further look.

This is seems like OTLP profiling signal breakage (we're making lots of breaking changes). If you use the latest devfiler and compile an agent using 47e8410 it should work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants