-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Would like to free Exp_pool earlier. #1413
base: master
Are you sure you want to change the base?
Conversation
It can get very large during MST parsing, and just takes up RAM that extract-linkages could use. So I'd like to free it, but I cannot, because parsing with null requires valid data, there.
@ampli any comments or ideas w.r.t this pull req? |
In generation mode we don't use parsing with nulls. A more drastic change would be to disallow 2-step parsing altogether (but in such a case I would leave it as a "test"). In that case we know there will not be a parse with nulls after a try to parse with no nulls. |
Yeah. I can add flag. Disallowing two-step parsing is not that "drastic" ... the toekenization and dictionary lookup happens pretty quick. We could just repeat that, for each null count. I don't know how to think about the trade-off between the "elegance" of not having to do tokenization several times, vs. the "efficiency" of discarding large chunks of RAM earlier. It's a trade-off. I'm actually surprised that the code was smart enough to know that it did not need to do the dict-lookup again. Link-grammar code is complicated. |
By "drastic" I ment to disallow it all together, not to actually support it by doing tokenization/expression pruning again. The traditional way to do the parsing, and what So not allowing 2-step parsing would invalidate most user-written parsers. On the other hand, re-tokenizing and doing expression-pruning again would currently waste several percent of CPU (and more if non-contiguous-morphology tokenizing is implemented, and moreover a better expression-pruning is implemented (that naturally takes more CPU but it is said to save more CPU in the power pruning). So I don't recommend at all forcing tokenization for additional sentence parses. But I just now got an idea: We can add a parse option to indicate we are not going to do an additional parse of this sentence, and then the memory pools can be safely released. (Optionally we can add support that if the user issues a parse with nulls anyway, then tokenization (and expression pruning) will be done again, as you wrote above. But I don't think this is needed if the user has explicitly set this parse option so an additional parse is not expected.) |
How much RAM, compared to I asked this to understand why it might be important to release this memory earlier. (I guess that not releasing memory (and just reusing it) would be the faster option.) |
Note that there is another way to save a lot of RAM: To trace the number of non-zero entries, configure with |
The size of Exp_pool is discussed in #1402 (comment) and I reproduce the graph, below. For English, the number of Elts stays under 100K more or less. For MST, it gets up to about 3M . For now, I can live with that, it's not urgent. Also, there are other, non-link-grammar dependencies, that alter the size the MST dictionary. It might get smaller in the future, or it might get bigger. I'm working with a beta-test dataset, right now.
It is about the same size, maybe slightly larger than
I guess this might be an OK idea. The only tables that get outrageously large are the mlc and
The size of |
|
In principle, it is possible to add a test argument to disable them. They speed up the I have an active branch in which I added |
I graphed As you can see in either graph, it can grow up to 30 million entries. So it remains under a gigabyte, which is acceptable. For me, anything under 10 gigabytes per thread is acceptable. After that the troubles compound.
I wish. All my runs are fiendishly complex.
If the MST runs become smaller or faster without these pools, then having the Off-topic: for me, having some of the |
It seems easy to implement. I can do that. |
if (IS_GENERATION(sent->dict)) | ||
create_wildcard_word_disjunct_list(sent, opts); | ||
|
||
build_sentence_disjuncts(sent, opts->disjunct_cost, opts); | ||
|
||
|
||
#ifdef NEED_THESE_FOR_NULL_AND_PANIC_PARSING |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can add a parse option to indicate that a second pass is not expected. In that case, we can not only free the Exp_pool
but immediately delete all the pools that are now pool-reused.
If a second pass happens anyway (disregarding the value of the said parse option) and there are no expressions, we can just return an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There can only be a second pass if either
null Allow null links 1 (On)
panic Use of "panic mode" 1 (On)
If both are off, then there's no second pass, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. But this is a link-parser convention only. The library doesn't know about it. For example, if the library gets a sentence_parse()
request with both min_null_count==0
and max_null_count==0
, it doesn't know whether a subsequent call with, e.g., min_null_count==1
and max_null_count=sent_length
will follow. So maybe we can add a free_memory
or even a batch_mode
parse option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes. I forgot what this was about. Perhaps a new struct Parse_Options_s
of bool single_pass
?
I have not yet commented on some things you mentioned: Regarding the possible earlier release of Regarding But I didn't understand the possible reason for earlier freeing the |
I want to keep this particular pull req open, but otherwise put it into "cold storage". Details of how I generate data are changing, How I use that data are changing, and so details of RAM and CPU usage will change as well. Frankly, my head is all clogged up with other 1001 other issues right now, and while it might be nice to think the above problem just "went away", I can't really revisit it, right now. |
It can get very large during MST parsing, and just takes up RAM
that extract-linkages could use. So I'd like to free it, but I
cannot, because parsing with null requires valid data, there.