Reduce eBPF Probe Execution Time #1611

dave-tucker · 2024-07-11T11:45:50Z

What would you like to be added?

Summary

#1535 adds some microbenchmarks for the eBPF probes.
The sched_switch probe is the most critical and it currently measures at almost 3 microseconds on my system - we'd like to get this somewhere in the order of a few hundred nanoseconds.
This is critical since that code is executed every time a task is scheduled on or off the CPU.

I've profiled the probe as follows:

Running kepler
Using perf record to gather samples
Analyze using sudo perf annotate -l bpf_prog_eb5663401a302a92_kepler_sched_switch_trace - you can get the tag from bpftool prog

In summary the hot parts of the probe appear to be as follows:

Code	Percent
Preparing Context	46.32%
bpf_map_lookup_elem(&processes, &prev_tgid)	4.35%
bpf_perf_event_read_value - cache miss	2.9%
bpf_get_current_pid_tgid()	2.79%
bpf_map_update_elem(&cpu_instructions, cpu_id, &val, BPF_ANY);	2.76%
bpf_map_lookup_elem(&processes, &curr_tgid)	2.62%
bpf_map_delete_elem(&pid_time_map, &prev_pid)	2.38%
bpf_map_update_elem(&cache_miss, cpu_id, &val, BPF_ANY);	1.98%
bpf_ktime_get_ns()	1.82%
bpf_perf_event_read_value - cpu_instructions	1.63%

The time spent preparing the context is a fixed overhead - mostly due to spectre/meltdown prevention code in kernel that we are not in control of.
When looking at what out probe code contributes you could view this as map operations being the key contributor to the overall probe execution time.

Proposal

Firstly we're only going to collect the following information on a sched_switch event:

Timestamp - bpf_ktime_get_ns()
Which CPU this event is for
prev_task->tgid
next_task->tgid
Value of the cpu_cycles hardware counter

Secondly that information is going to immediately be sent back to userland using a BPF_MAP_TYPE_RINGBUF

Userland is going to constant read from that ring buffer and will be responsible for performing the delta calculations that were previously done in the kernel.

This should get our eBPF probe execution time down really low.

Why is this needed?

Previously sampling was used to reduce probe execution time. Per the discussion in #1607 with recent changes to the eBPF probes for correctness, this is no longer yielding any benefit. We do however need to reduce the probe execution time in order to have less impact on the system.

The text was updated successfully, but these errors were encountered:

sthaha · 2024-07-12T00:32:05Z

I like the idea of collecting only the required (raw) information through ebpf and offload all calculations to userland. Definitely worth a try.

dave-tucker added the kind/feature New feature or request label Jul 11, 2024

dave-tucker mentioned this issue Jul 11, 2024

bpf sampling no longer reduces overhead #1607

Closed

dave-tucker added this to the v0.7.12 milestone Jul 11, 2024

dave-tucker mentioned this issue Jul 11, 2024

[rfe] default bpf sample rate #1553

Open

dave-tucker mentioned this issue Jul 23, 2024

feat(bpf): Send events via Ring Buffer #1628

Merged

sthaha closed this as completed in #1628 Jul 30, 2024

dave-tucker mentioned this issue Aug 6, 2024

Revert "Merge pull request #1628 from dave-tucker/new-ebpf" #1677

Merged

dave-tucker reopened this Aug 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce eBPF Probe Execution Time #1611

Reduce eBPF Probe Execution Time #1611

dave-tucker commented Jul 11, 2024 •

edited

Loading

sthaha commented Jul 12, 2024

Reduce eBPF Probe Execution Time #1611

Reduce eBPF Probe Execution Time #1611

Comments

dave-tucker commented Jul 11, 2024 • edited Loading

What would you like to be added?

Summary

Proposal

Why is this needed?

sthaha commented Jul 12, 2024

dave-tucker commented Jul 11, 2024 •

edited

Loading