-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document minimum_probability=0.0 in LDA #2625
base: develop
Are you sure you want to change the base?
Conversation
@mpenkov take a look please |
Does this actually achieve the goal of the PR? I think code in ldamodel.py enforces a minimum on the probability: https://github.com/RaRe-Technologies/gensim/blob/2131e3a3b0231c7eb0b0b447539d2ab7c8802633/gensim/models/ldamodel.py#L1313 So even if you specify zero, there's no guarantee that you will get all topics. |
@piskvorky @menshikh-iv I recall discussing this in another ticket/PR... what do you guys think? Personally, I think the enforcing line is misleading and can produce surprise for the user. If the user passes zero (a non-default), then they mean zero, not 1e-8. |
IIRC the issue was with segfaults: there's a contract in Gensim that sparse vectors (both BoW lists and scipy.sparse) don't contain explicit zeros. Breaking the contract made scipy.sparse confused, and it segfaulted at some point: the number of declared elements didn't match the number of actual sparse elements. Probably fixable, we could move enforcing the contract from the point of creating sparse vectors to the point of consuming them, possibly at some (minor) cost to performance. The solution we've been offering people who "want zeros" is to explicitly convert the sparse vector to a dense one, e.g. with matutils.sparse2full. |
only
no, that's not a problem, because we don't discuss the "training" stage, we talked only about inference.
not really, because of |
You cannot rely on explicit zeros in programming with floats. 1e-8 is meant as "epsilon" = value small enough that users don't care. A topic with probability 0.00000001 is as good as probability 0. No practical difference (and if you're relying on such difference, if it's significant, there's something wrong with your pipeline).
Yeah, that's what this ticket is about. But keeping zeros in sparse representations ain't the solution. At least not naively like this.
It is a problem, for the reasons given. "Training stage" has nothing to do with it. |
Ask @mpenkov about concrete practical example (we discuss that issue a long time ago based on some code piece), in the case when your model trains on the small corpus, this can be important.
For me this sounds like a good solution (in case if I set
I don't get your point, what's an exact problem with scipy by your opinion when we infer vector? Maybe a code example? |
@menshikh-iv @piskvorky any ideas how can I close the PR? |
Documentation for using minimum_probability parameter to obtain all topic distribution. Fixed #2489.