Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[experiment] gumbo using a malloc arena #2790

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

flavorjones
Copy link
Member

@flavorjones flavorjones commented Feb 7, 2023

What problem is this PR intended to solve?

I've been staring at performance profiling for HTML5 document parsing and can't get past the whopping 26% of time we're spending in the many calls to malloc and free.

Thinking about how we use libgumbo in Nokogiri:

  • we build a gumbo document structure, allocating many small strings and structures along the way
  • we convert that into a libxml2 document structure
  • we free all the memory making up the gumbo data structure, making many calls to free along the way

But: we essentially throw away the gumbo data structure immediately every time. What if we used an ephemeral arena for each gumbo document (or fragment) and just reset the entire arena after use?

This PR is an attempt to see how much time we might save, so it's the simplest thing that could work to let me benchmark:

  • it allocates a big chunk of memory up front (a real implementation would grow the arena)
  • it does not support multi-threaded use (for that we'll need to pass the Arena struct around everywhere which will add significant complexity)

The result: a 20% overall improvement on a large (681kb) real-world doc. Juxtaposed with HTML4 timings for context:

# 656 bytes, 8.5% speedup
Comparison:
malloc html4 parse 656:    18760.2 i/s
arena html5 parse 656:    15995.3 i/s - same-ish: difference falls within error
malloc html5 parse 656:    14731.6 i/s - same-ish: difference falls within error

# 70kb, 30% speedup (?)
Comparison:
malloc html4 parse 70095:      360.7 i/s
arena html5 parse 70095:      290.3 i/s - same-ish: difference falls within error
malloc html5 parse 70095:      223.5 i/s - 1.61x  (± 0.00) slower

# 681kb, 20% speedup
Comparison:
malloc html4 parse 681406:       41.5 i/s
arena html5 parse 681406:       31.7 i/s - same-ish: difference falls within error
malloc html5 parse 681406:       26.2 i/s - same-ish: difference falls within error

# 1.9mb, 15% speedup
Comparison:
malloc html4 parse 1929522:       32.8 i/s
arena html5 parse 1929522:       16.0 i/s - 2.05x  (± 0.00) slower
malloc html5 parse 1929522:       13.9 i/s - 2.36x  (± 0.00) slower

Time spent in memory-related calls drops from 26.3% to 18% in the 681kb document benchmark.

@flavorjones
Copy link
Member Author

@stevecheckoway Do you have any strong feelings about this approach? It feels like it may be worth the additional complexity to try to finish this work up, but don't want to do it if you're horrified. 😄

@stevecheckoway
Copy link
Contributor

I didn't read the commits in detail, but the approach of using an arena allocator makes sense. I'm still confused by the pthread_attr_setschedparam calls though.

Are you sure that's caused by malloc and not the profiling tool getting confused? A cursory glance at glibc's malloc doesn't reveal any direct calls to that function and I can't imagine what thread scheduling would have to do with memory allocation, even in the presence of threads. Not that this matters in the end. If you're getting a real performance improvement, moving to an arena seems reasonable.

@flavorjones
Copy link
Member Author

@stevecheckoway You may be right about the profiler being confused about the particular function, but the profiler is consistent in saying that children of memory-allocation routines take up a lot of time. Here's the stack profile:

philosophy html perf

@flavorjones
Copy link
Member Author

@stevecheckoway I ran the HTML5 parser under cachegrind and did get similar percentages rolled up under malloc-and-friends (but notably the pthread function is not mentioned, that must be a perftools bug).

Here's the top exclusive time functions

I've marked the malloc-related functions with **.

--------------------------------------------------------------------------------
Ir                    file:function
--------------------------------------------------------------------------------
390,423,514 ( 7.83%)  gumbo-parser/utf8.c:read_char [/home/flavorjones/code/oss/nokogiri/lib/nokogiri/nokogiri.so]
366,486,405 ( 7.35%)  ** glibc-2.31/malloc/malloc.c:_int_malloc [/usr/lib/x86_64-linux-gnu/libc-2.31.so]
270,510,622 ( 5.43%)  ** glibc-2.31/malloc/malloc.c:_int_free [/usr/lib/x86_64-linux-gnu/libc-2.31.so]
214,266,390 ( 4.30%)  gumbo-parser/tokenizer.c:gumbo_lex [/home/flavorjones/code/oss/nokogiri/lib/nokogiri/nokogiri.so]
174,424,705 ( 3.50%)  gumbo-parser/string_buffer.c:gumbo_string_buffer_append_codepoint [/home/flavorjones/code/oss/nokogiri/lib/nokogiri/nokogiri.so]
146,082,668 ( 2.93%)  ** glibc-2.31/malloc/malloc.c:malloc [/usr/lib/x86_64-linux-gnu/libc-2.31.so]
141,637,769 ( 2.84%)  gumbo-parser/parser.c:handle_in_body [/home/flavorjones/code/oss/nokogiri/lib/nokogiri/nokogiri.so]
137,864,597 ( 2.77%)  gumbo-parser/parser.c:gumbo_parse_with_options [/home/flavorjones/code/oss/nokogiri/lib/nokogiri/nokogiri.so]
125,220,415 ( 2.51%)  ruby-3.1.2/parse.c:ruby_yyparse [/home/flavorjones/.rbenv/versions/3.1.2/lib/libruby.so.3.1.2]
115,254,958 ( 2.31%)  gumbo-parser/utf8.c:utf8iterator_next [/home/flavorjones/code/oss/nokogiri/lib/nokogiri/nokogiri.so]
 90,970,620 ( 1.83%)  gumbo-parser/tokenizer.c:handle_attr_value_double_quoted_state [/home/flavorjones/code/oss/nokogiri/lib/nokogiri/nokogiri.so]
 81,276,039 ( 1.63%)  gumbo-parser/parser.c:reconstruct_active_formatting_elements [/home/flavorjones/code/oss/nokogiri/lib/nokogiri/nokogiri.so]
 80,979,775 ( 1.62%)  ** glibc-2.31/malloc/malloc.c:malloc_consolidate [/usr/lib/x86_64-linux-gnu/libc-2.31.so]
 79,724,502 ( 1.60%)  gumbo-parser/utf8.h:read_char
 70,040,521 ( 1.41%)  ** glibc-2.31/malloc/malloc.c:free [/usr/lib/x86_64-linux-gnu/libc-2.31.so]
 64,521,678 ( 1.29%)  gumbo-parser/tokenizer.c:finish_token.isra.0 [/home/flavorjones/code/oss/nokogiri/lib/nokogiri/nokogiri.so]
 63,285,175 ( 1.27%)  ruby-3.1.2/parse.y:ruby_yyparse
 53,476,387 ( 1.07%)  ruby-3.1.2/gc.c:objspace_xmalloc0 [/home/flavorjones/.rbenv/versions/3.1.2/lib/libruby.so.3.1.2]
 53,351,571 ( 1.07%)  gumbo-parser/parser.c:get_current_node.isra.0 [/home/flavorjones/code/oss/nokogiri/lib/nokogiri/nokogiri.so]
 48,078,332 ( 0.96%)  ruby-3.1.2/gc.c:ruby_sized_xfree [/home/flavorjones/.rbenv/versions/3.1.2/lib/libruby.so.3.1.2]
 47,606,972 ( 0.96%)  ** glibc-2.31/malloc/malloc.c:malloc_usable_size [/usr/lib/x86_64-linux-gnu/libc-2.31.so]
 44,631,418 ( 0.90%)  gumbo-parser/string_buffer.c:maybe_resize_string_buffer [/home/flavorjones/code/oss/nokogiri/lib/nokogiri/nokogiri.so]
 43,778,475 ( 0.88%)  gumbo-parser/parser.c:insert_text_token.isra.0 [/home/flavorjones/code/oss/nokogiri/lib/nokogiri/nokogiri.so]
 42,218,715 ( 0.85%)  ruby-3.1.2/st.c:find_table_entry_ind [/home/flavorjones/.rbenv/versions/3.1.2/lib/libruby.so.3.1.2]
 42,189,298 ( 0.85%)  ** glibc-2.31/malloc/malloc.c:unlink_chunk.isra.0 [/usr/lib/x86_64-linux-gnu/libc-2.31.so]
 41,857,797 ( 0.84%)  ruby-3.1.2/st.c:rb_st_lookup [/home/flavorjones/.rbenv/versions/3.1.2/lib/libruby.so.3.1.2]
 41,302,563 ( 0.83%)  ruby-3.1.2/gc.c:rgengc_check_relation.part.0 [/home/flavorjones/.rbenv/versions/3.1.2/lib/libruby.so.3.1.2]
 40,823,473 ( 0.82%)  ruby-3.1.2/gc.c:gc_mark_ptr [/home/flavorjones/.rbenv/versions/3.1.2/lib/libruby.so.3.1.2]
 40,756,430 ( 0.82%)  gumbo-parser/tokenizer.c:append_char_to_tag_buffer [/home/flavorjones/code/oss/nokogiri/lib/nokogiri/nokogiri.so]
 37,432,874 ( 0.75%)  ** glibc-2.31/malloc/malloc.c:realloc [/usr/lib/x86_64-linux-gnu/libc-2.31.so]
 36,069,384 ( 0.72%)  gumbo-parser/tokenizer.c:emit_char [/home/flavorjones/code/oss/nokogiri/lib/nokogiri/nokogiri.so]
 32,630,299 ( 0.65%)  gumbo-parser/tokenizer.c:handle_data_state [/home/flavorjones/code/oss/nokogiri/lib/nokogiri/nokogiri.so]
 28,598,943 ( 0.57%)  ** glibc-2.31/malloc/malloc.c:_int_realloc [/usr/lib/x86_64-linux-gnu/libc-2.31.so]
 28,511,268 ( 0.57%)  tmp/x86_64-linux/nokogiri/3.2.1/tmp/x86_64-pc-linux/ports/libxml2/2.10.3/libxml2-2.10.3/xmlstring.c:xmlStrdup [/home/flavorjones/code/oss/nokogiri/lib/nokogiri/nokogiri.so]
 27,633,613 ( 0.55%)  ruby-3.1.2/compile.c:iseq_setup [/home/flavorjones/.rbenv/versions/3.1.2/lib/libruby.so.3.1.2]
 25,140,201 ( 0.50%)  ** glibc-2.31/string/../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:__memcpy_avx_unaligned_erms [/usr/lib/x86_64-linux-gnu/libc-2.31.so]

@stevecheckoway
Copy link
Contributor

Great! I'm glad there's not some bizarre pthread call happening somewhere.

As I mentioned, using an arena allocator makes a lot of sense here.

Another possibility would be to try to directly produce an xmlDoc from gumbo's parser. On the one hand, that'd be a bit of a shame for gumbo since it'd be tied down to libxml. On the other, cutting out all of those allocations (plus the whole additional tree walk) is probably a big win.

Daweman

This comment was marked as off-topic.

@flavorjones flavorjones added this to the v1.18.0 milestone Jul 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants