You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
#1535 adds some microbenchmarks for the eBPF probes.
The sched_switch probe is the most critical and it currently measures at almost 3 microseconds on my system - we'd like to get this somewhere in the order of a few hundred nanoseconds.
This is critical since that code is executed every time a task is scheduled on or off the CPU.
I've profiled the probe as follows:
Running kepler
Using perf record to gather samples
Analyze using sudo perf annotate -l bpf_prog_eb5663401a302a92_kepler_sched_switch_trace - you can get the tag from bpftool prog
In summary the hot parts of the probe appear to be as follows:
The time spent preparing the context is a fixed overhead - mostly due to spectre/meltdown prevention code in kernel that we are not in control of.
When looking at what out probe code contributes you could view this as map operations being the key contributor to the overall probe execution time.
Proposal
Firstly we're only going to collect the following information on a sched_switch event:
Timestamp - bpf_ktime_get_ns()
Which CPU this event is for
prev_task->tgid
next_task->tgid
Value of the cpu_cycles hardware counter
Secondly that information is going to immediately be sent back to userland using a BPF_MAP_TYPE_RINGBUF
Userland is going to constant read from that ring buffer and will be responsible for performing the delta calculations that were previously done in the kernel.
This should get our eBPF probe execution time down really low.
Why is this needed?
Previously sampling was used to reduce probe execution time. Per the discussion in #1607 with recent changes to the eBPF probes for correctness, this is no longer yielding any benefit. We do however need to reduce the probe execution time in order to have less impact on the system.
The text was updated successfully, but these errors were encountered:
What would you like to be added?
Summary
#1535 adds some microbenchmarks for the eBPF probes.
The sched_switch probe is the most critical and it currently measures at almost 3 microseconds on my system - we'd like to get this somewhere in the order of a few hundred nanoseconds.
This is critical since that code is executed every time a task is scheduled on or off the CPU.
I've profiled the probe as follows:
perf record
to gather samplessudo perf annotate -l bpf_prog_eb5663401a302a92_kepler_sched_switch_trace
- you can get the tag frombpftool prog
In summary the hot parts of the probe appear to be as follows:
The time spent preparing the context is a fixed overhead - mostly due to spectre/meltdown prevention code in kernel that we are not in control of.
When looking at what out probe code contributes you could view this as
map operations
being the key contributor to the overall probe execution time.Proposal
Firstly we're only going to collect the following information on a sched_switch event:
prev_task->tgid
next_task->tgid
cpu_cycles
hardware counterSecondly that information is going to immediately be sent back to userland using a BPF_MAP_TYPE_RINGBUF
Userland is going to constant read from that ring buffer and will be responsible for performing the delta calculations that were previously done in the kernel.
This should get our eBPF probe execution time down really low.
Why is this needed?
Previously sampling was used to reduce probe execution time. Per the discussion in #1607 with recent changes to the eBPF probes for correctness, this is no longer yielding any benefit. We do however need to reduce the probe execution time in order to have less impact on the system.
The text was updated successfully, but these errors were encountered: