Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce eBPF Probe Execution Time #1611

Open
dave-tucker opened this issue Jul 11, 2024 · 1 comment · Fixed by #1628
Open

Reduce eBPF Probe Execution Time #1611

dave-tucker opened this issue Jul 11, 2024 · 1 comment · Fixed by #1628
Labels
kind/feature New feature or request
Milestone

Comments

@dave-tucker
Copy link
Collaborator

dave-tucker commented Jul 11, 2024

What would you like to be added?

Summary

#1535 adds some microbenchmarks for the eBPF probes.
The sched_switch probe is the most critical and it currently measures at almost 3 microseconds on my system - we'd like to get this somewhere in the order of a few hundred nanoseconds.
This is critical since that code is executed every time a task is scheduled on or off the CPU.

I've profiled the probe as follows:

  • Running kepler
  • Using perf record to gather samples
  • Analyze using sudo perf annotate -l bpf_prog_eb5663401a302a92_kepler_sched_switch_trace - you can get the tag from bpftool prog

In summary the hot parts of the probe appear to be as follows:

Code Percent
Preparing Context 46.32%
bpf_map_lookup_elem(&processes, &prev_tgid) 4.35%
bpf_perf_event_read_value - cache miss 2.9%
bpf_get_current_pid_tgid() 2.79%
bpf_map_update_elem(&cpu_instructions, cpu_id, &val, BPF_ANY); 2.76%
bpf_map_lookup_elem(&processes, &curr_tgid) 2.62%
bpf_map_delete_elem(&pid_time_map, &prev_pid) 2.38%
bpf_map_update_elem(&cache_miss, cpu_id, &val, BPF_ANY); 1.98%
bpf_ktime_get_ns() 1.82%
bpf_perf_event_read_value - cpu_instructions 1.63%

The time spent preparing the context is a fixed overhead - mostly due to spectre/meltdown prevention code in kernel that we are not in control of.
When looking at what out probe code contributes you could view this as map operations being the key contributor to the overall probe execution time.

Proposal

Firstly we're only going to collect the following information on a sched_switch event:

  1. Timestamp - bpf_ktime_get_ns()
  2. Which CPU this event is for
  3. prev_task->tgid
  4. next_task->tgid
  5. Value of the cpu_cycles hardware counter

Secondly that information is going to immediately be sent back to userland using a BPF_MAP_TYPE_RINGBUF

Userland is going to constant read from that ring buffer and will be responsible for performing the delta calculations that were previously done in the kernel.

This should get our eBPF probe execution time down really low.

Why is this needed?

Previously sampling was used to reduce probe execution time. Per the discussion in #1607 with recent changes to the eBPF probes for correctness, this is no longer yielding any benefit. We do however need to reduce the probe execution time in order to have less impact on the system.

@sthaha
Copy link
Collaborator

sthaha commented Jul 12, 2024

I like the idea of collecting only the required (raw) information through ebpf and offload all calculations to userland. Definitely worth a try.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants