-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Implement position-dependent weighting to fastText #2905
base: develop
Are you sure you want to change the base?
Conversation
The code currently segfaults immediately after building the vocab. @gojomo, I would appreciate a quick glance over the changes to see if there is something obvious I have missed before I start debugging the Cython code. |
I don't feel familiar enough with what should be happening to proofread the kinds of indexing-operations that are likely at fault here, so I'd suggest getting right to the deeper debugging with A couple style comments:
Sorry my review isn't more substantive. |
Thank you for the review.
Just for documentation, The original paper does not investigate, but the likely outcome is that the positional vectors will grow in magnitude close to the center of the window, so that the impact of more distant words is attenuated. An interesting experiment would be to use just a
Since I am not too experienced with
I was thinking that if this parameter ever appears in FB's code base,
Good point. Depending on the font,
It's plenty substantive, thanks. I have experience with C and C++ programming, but I am new to Cython. |
27ba384
to
dc56ae2
Compare
The issue was that I assumed that Currently, I am training a model to see if I can reproduce the experimental results from the original paper as discussed in issue #2840. If not, it's back to playing lego with Cython. Otherwise, I will be adding my experimental Jupyter Notebook to the PR. So far it seems that the new code is up to 8× slower than Further TODOs include:
|
Dense updates of positional weights are a poor fit for Hogwild. Single worker may be required for optimal rate of convergence. Not sure how large of an issue this will be in practice. If it is, perhaps updating only one weight vector in each minibatch could be helpful. |
It also seems that unless we reduce the gradients to the positional vectors, both positional and word vectors tend to NaN. I am currently exploring this issue using positional weight scalars as opposed to positional weight vectors, because the former is currently too computationally expensive to quickly iterate. When I update positional weights by their full gradients (44e1d83), we end up with both positional weights and word vectors full of NaNs. When I update positional weights by a mean gradient across all dimensions (3233115), believable positional weight vectors that amplify the effect of words close to the center word are trained ( |
Analogy scores aren't everything - I've seen cases where a word2vec model was measurably better on the real classification task for which it was needed, even though it was worse for analogies. But, I do see the original PDW paper claimed an improvement on analogies. However, it looks like the only specific PDW-vs-non-PDW results they report (Table 2 in Section 4) are:
If you're testing just a single scalar per position, rather than a vector (that might allow dimensions of the vectors to spcialize in 'near' or 'far' influences), it strike me you may not be testing the same effect at all, and I wouldn't necessarily expect the same results. (And, an analogy difference of just ~1.31% might just be run-to-run jitter.) Looking at the weights your experiment learned, it seems only the 'next-door' words have very-different weights (>2x). Perhaps if there is a benefit, simply adopting some simple weighting that's more-near-word-heavy than the linear-dropping via the |
@gojomo To reinforce the argument for changing the initialization: in CBOW, context word embeddings are averaged and then passed to the hidden layer. Whatever the initial distribution (normal or uniform), the average tends to the normal distribution by the central limit theorem. Therefore, the distribution is unimportant as long as the variance of the hidden layer input stays the same. We don't want to initialize i.e. the positional embeddings to ones, since that makes the gradient updates of word embeddings blow up, as discussed in #2905 (comment). Instead, we want to initialize both positional and word embeddings with the same distribution (either uniform, or normal, or ...) whose product has the same variance (1 / fanout) as the initial word embeddings in vanilla fastText. |
If the distribution is unimportant, then faithfully matching the actual Somewhat relatedly: in current re-factoring of the initialization for other purposes (#2955 & #2975), I've noticed for the 1st time that Google's |
Thank you for the measurements. I wonder, compared to training time, would the difference in initialization time be significant?
To be clear, the switch to normal distribution is not "to be different", but because no reference implementation is available and the uniform distribution is difficult to factorize as discussed in #2905 (comment). We need to factorize, because positional weighting multiplies word and positional vectors before the hidden layer, and it is their product that needs to have the proper distribution.
Interesting! I can't say I know why the choice was made, but it is likely to decrease the initial effective learning rate in word2vec compared to fastText. |
Wouldn't initializing the new positional weights however they need to be initialized be sufficient? Then, word-vector/ngram-vector initialization can remain conventional. Separately, even in the absence of a PDW reference implementation, can you ask the PDW paper authors what they actually did, and whether they changed the word-vector initialization in any way? |
That's actually what the current code does: word vectors are initialized with uniform distribution and positional vectors are initialized to ones. This is nice and intuitive, but it leads to exploding gradients of word vectors. Unless we arbitrarily decrease the gradients, the word vectors tend to NaN. Changing the initialization, at least when position-dependent weighting is enabled, would allow us to avoid the arbitrary gradient clipping.
I have asked on October 5, but I have received no response yet. |
We held a seminar session regarding the position-dependent weighting with our research group yesterday. |
76a6da8
to
109caad
Compare
109caad
to
94a57ff
Compare
@gojomo I bring good news: I got rid of the gradient clipping hack without changing the initialization, i.e. we are keeping the vanilla fastText initialization for vector and n-gram weights and we initialize the positional vectors to ones. The exploding gradients were related to improper normalization of gradient updates to positional weights, which I have fixed in 94a57ff. Since the gradient clipping hack decreased the learning rate for positional weights, we see another 5% increase in the English word analogy task accuracy*:
I am currently re-running the |
That looks awesome. Do I understand correctly the "improper normalization" is present in the current Gensim too? |
The improper normalization is only related to this PR and the positional vectors, i.e. there should be no issue with either the current Gensim or FB's fastText. |
fa9dfcf
to
94a57ff
Compare
I re-ran the We now reach a higher maximum accuracy* on the English word analogy task and never get below the accuracy* of vanilla fastText. I am currently re-running the |
@piskvorky @gojomo In two upcoming articles from our research group, we have compared different initializations of positional weighting on the English word analogy task. The results show that although the simplest initialization option discussed in #2905 (comment) (keeping the default initialization for word vectors and initializing positional vectors to ones) achieves strong performance, it is unstable for larger vector dimensionalities, whereas initializing both word vectors and positional vectors with the uniform distribution is both fast and stable. We have also compared the usefulness of positionally-weighted and basic word vectors for text classification and language modeling and found that both application benefit from the positional word vectors: These results are still using only the positional word vectors trained on the English Wikipedia without phrasing, i.e. they are not the SOTA. Nevertheless, they (a) lay theoretical foundations for the initialization and test them empirically, and (b) lllustrate the general usefulness of positional weighting across two different extrinsic NLP tasks. It's not conclusive evidence, but it's evidence. |
064829c
to
39cb21a
Compare
39cb21a
to
8f9110a
Compare
@Witiko could you check https://groups.google.com/g/gensim/c/5mAeWrQN7lg/m/7Ul-uMJWBAAJ please? (FYI, in case you're not getting the mailing list notifications). Thanks! |
The work from this pull request is currently available in the witiko/gensim@pine fork of Gensim 3.8.3 and through the high-level MIR-MU/PInE library. The architecture and the conducted experiments are described in our (admittedly clickbaity) preprint “When FastText Pays Attention: Efficient Estimation of Word Representations using Constrained Positional Weighting” on arXiv. |
The work from this draft has been accepted as a journal paper, which was co-authored by @piskvorky and is to appear in J.UCS 28:2. As I wrote above, we have produced the high-level PInE library, which was used for the experiments and which uses a fork of Gensim 3.8.3 for model training. If there is interest, we can rebase the fork on top of the current |
This PR adds support for position-dependent weighting for fastText CBOW with negative sampling, as discussed in issue #2840 and facebookresearch/fastText#445. The
FastText
constructor receives a new boolean parameterposition_dependent_weights
, which enables the position-dependent weighting when set to 1.Tasks to complete:
fasttext_fast_sentence_cbow_neg
into two functions.Speed up by replacing loops with BLAS primitives.vector_size
(100, 200, 300, …, 700, 800),window
(1, 2, 3, 4, 5, 10, 15, 20, 25, 30, 40, 50), andsample
(10-3, 10-4, 10-5, 10-6).vector_size
.reduced_windows
and rerun parameter optimization on parameterwindow
.positional_dependent_weights
parameter.