Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Implement arena allocation for VersionVec #12

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

ibabushkin
Copy link

I'm submitting this PR alongside my GSoC proposal (see
tokio-rs/gsoc#2), providing two core changes:

  • Building the fringe scheduler on nightly is possible again.
  • VersionVec objects can now be allocated in an arena cleared after each iteration of
    the model checker. The implementation is based on the stub already present.

However, some caveats remain: It is debatable whether integrating the arena into the
Execution struct is the right way forward, since passing around a mutable reference to
the arena to allow allocation in various modules is certainly not ergonomic.

So far, performance on the workloads I've tested has been worse than expected, and plenty
other questions need to be discussed:

  • Should other data structures be allocated in the arena?
  • Should the allocation interface be changed?
  • Does a drop-in solution for pointer bump arena allocation provide a performance benefit
    or some other advantage?
  • Why did this implementation of arena allocation for some objects not provide the
    expected speedup?

Additionally, I was able to discover a resource leak in one of the tokio-sync tests
using this patch set: leaking objects mocked by loom can cause the arena to refuse
clearing, if Slices allocated on behalf of the tested program are still live, so if
the tested program contains such a leak, the test crashes when loom tries to clear the
arena after the leaky iteration.

Here's the relevant function from the
test:

// setup omitted

#[path = "../src/mpsc/list.rs"]
mod list;

#[test]
fn smoke() {
    loom::fuzz(|| {
        let (tx, mut rx) = list::channel();

        // lots of code omitted here...

        // This is necessary to avoid the leak, as no `Drop` impl for `list::Rx` is
        // provided, so the allocated blocks are never returned, causing a leak of objects
        // allocated by loom. On the other hand, providing a `Drop` impl breaks other
        // code.
        unsafe { rx.free_blocks(); }
    });
}

I'm unsure whether this is worth opening a tokio issue for, as the bug is largely
irrelevant in practise, and only amounts to leaking (tiny) amounts of memory in a test.

Unsure how this worked before, but the macro used to declare the
scheduler's state never existed in scoped-tls afaict. Fix involves doing
disgusting things to RefCell (so rather hacky).
To allow for cloning objects allocated in the arena, some rearrangements
had to be made, namely moving allocation functions to a place where they
can be used with a shared reference to the arena.

This sounds dangerous, as no synchronization is done. However, it should
be safe: with every scheduler, no true concurrency is present in the
loom runtime, and the only scheduler using multiple threads (`std`)
places the global execution state behind a mutex.

Still, this is somewhat of an opinionated approach to integrating the
arena allocator into loom.
As a first step, allocating clock vectors in the arena seems reasonable
and allows to gauge how much complexity such a change introduces.

It turns out that at the moment, handling of the arena requires that it
is passed around to code that should be able to allocate in it. However,
a globally accessible arena could prove to be more elegant. Also, not
all `VersionVec` instances are deallocated after the end of each
iteration, so they are currently handled by allowing `VersionVec`s to be
allocated on the heap as before as well.
@carllerche
Copy link
Member

Thanks for submitting the PR.

Could you post before / after benchmark results?

Also, what version of tokio-sync are you using? I know there was a leak bug fixed recently. Being able to detect leaks is an added bonus.

@ibabushkin
Copy link
Author

ibabushkin commented Apr 8, 2019

Argh, sorry -- I forgot to reference all the relevant information as it's already included in the proposal. Luckily, the relevant bits are available here.

To summarize: Commit 824b7b675990baab87fa3df347fa4a5491f6809b from the tokio repo, and below is my benchmark (the fuzz_semaphore test from tokio-sync). Both versions use the fringe scheduler, here is a branch that fixes the build for that.

LOOM_MAX_DURATION=10 hyperfine \
  --export-markdown bench.md ./fuzz_semaphore_arena ./fuzz_semaphore_system
Command Mean [s] Min … Max [s]
fuzz_semaphore_arena 47.372 ± 0.517 46.685 … 48.222
fuzz_semaphore_system 42.656 ± 2.242 40.781 … 48.764

I realize I also forgot to test the serialization feature and accidentally committed some nightly-only stuff though, I'll fix that shortly.

@carllerche
Copy link
Member

It looks like the measurement is the total time of execution, but you are also capping it with LOOM_MAX_DURATION (though, it is set to 10 seconds...).

The fact that an arena is slower is unexpected. I would dig in and see what the time is spent on w/ a profile.

@ibabushkin
Copy link
Author

Yeah, that is odd. However, since multiple different tests are run in the binary, the rough ballpark of the runtimes makes sense. I don't yet have an explanation on the slowdown, but I'm working on that.

For now, we only deserialize to heap-allocated `VersionVec`s.
Also, we now have an iterator for arena slices.
@ibabushkin
Copy link
Author

Aaand I forgot windows has no mmap!

@ibabushkin
Copy link
Author

Excuse the force-push -- but windows builds should work now!

* Less inefficient cloning of `Slice`s
* Much more efficient dropping of `Slice`s
@ibabushkin
Copy link
Author

Alright, most of the performance regression is now gone after optimizing the Drop and Clone impls on Slices. There is some overhead left, but at the moment I am unsure we can get rid of it -- it seems that what is left in terms of unwanted slowdown is spread throughout various places and is thus hard to get rid of. However, it could be possible to fix this by moving more objects to the arena -- that way, the overhead from various operations that are slower on arena-allocated objects could be offset in other places (there already local speedups to be seen).

@ibabushkin
Copy link
Author

Just a quick update: I'm currently experimenting with ways to provide arena allocation for other uses of Vec in the code which perform resizing operations. I'll post results/code when I have a clear picture of the performance properties of the possible implementations.

@carllerche
Copy link
Member

Quick update from me. I’ve been on vacation. I’ll be digging in next week. Expect to hear more then.

@ibabushkin
Copy link
Author

I'm exploring the possibility to store the arena in a scoped thread local variable to buy us some flexibility wrt where exactly the arena can be used. I'll benchmark a simple arena-backed vector implementation based on that.

@ibabushkin
Copy link
Author

Another update: Pushing more and more data structures into the arena improves performance. As of now, I am still debugging the implementation though. Also, I haven't found a nice way to scale arena usage to more data structures other than to store the arena "control block" in a thread local variable. This has the obvious drawback that the std scheduler is no longer usable.

@carllerche
Copy link
Member

@ibabushkin Putting it in a thread-local is fine. If we don't have std, we can always use alternate strategies for accessing the arena 👍.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants