Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Implement position-dependent weighting to fastText #2905

Draft
wants to merge 7 commits into
base: develop
Choose a base branch
from

Conversation

Witiko
Copy link
Contributor

@Witiko Witiko commented Jul 28, 2020

This PR adds support for position-dependent weighting for fastText CBOW with negative sampling, as discussed in issue #2840 and facebookresearch/fastText#445. The FastText constructor receives a new boolean parameter position_dependent_weights, which enables the position-dependent weighting when set to 1.

Tasks to complete:

  • Successfully implement position-dependent weighting as described in the 2017 “Advances” paper by Mikolov et al.
  • Investigate if positional weight vectors can be replaced with simple positional scalar weights, resulting in further speedup.
  • Investigate if using the window size of 15 instead of 5 is the cause of the improved accuracy of position-dependent weighting.
  • Speed up by splitting fasttext_fast_sentence_cbow_neg into two functions.
  • Speed up by replacing loops with BLAS primitives.
  • Visualize the learnt positional weight vectors.
  • Visualize the learnt positional weight vectors for languages with free word order (Czech).
  • Visualize word vectors with the smallest/largest Var[dp ⊙ ut + p] to see which words are least/most position-dependent.
  • Using a validation subset of the English word analogy task, implement and evaluate various values for positional weight vector dimensionality (0, 5, 10, …, 295, 300), and evaluate various values of vector_size (100, 200, 300, …, 700, 800), window (1, 2, 3, 4, 5, 10, 15, 20, 25, 30, 40, 50), and sample (10-3, 10-4, 10-5, 10-6).
  • Remove the current hack in the initialization with the method of Pinelis (2018), and rerun parameter optimization on parameter vector_size.
  • Show that initialization with standard normal distribution slows down convergence compared to Pinelis (2018).
  • Remove reduced_windows and rerun parameter optimization on parameter window.
  • Evaluate using the average of positional weight vectors for producing the final word embeddings.
  • Implement and evaluate various initializations of frozen positional weight vectors.
  • Unit-test the positional_dependent_weights parameter.
  • For the final evaluation, train on the English portion of the deduped Common Crawl.
  • For the final evaluation, test on different tasks (multilingual word analogy, word similarity, text classification, language modeling).

@Witiko Witiko marked this pull request as draft July 28, 2020 19:13
@Witiko
Copy link
Contributor Author

Witiko commented Jul 28, 2020

The code currently segfaults immediately after building the vocab. @gojomo, I would appreciate a quick glance over the changes to see if there is something obvious I have missed before I start debugging the Cython code.

@gojomo
Copy link
Collaborator

gojomo commented Jul 29, 2020

I don't feel familiar enough with what should be happening to proofread the kinds of indexing-operations that are likely at fault here, so I'd suggest getting right to the deeper debugging with gdb or cygdb, etc. (And I've often hoped some simple errors could be caught, debugging cython code, by changing the boundscheck to on until debugged... but that's never found a real issue for me. Still, maybe worth a quick try.)

A couple style comments:

  • opaque abbreviations like pdw are generally disfavored, even though some standard-practices from elsewhere (eg sg & hs) are tolerated. If FB's own FastText used pdw as a parameter, maybe it could be used here for consistency, but otherwise something more descriptive would be preferred, especially at the Python/public-API level. Maybe either the full expansion, position_dependent_weighting, or something more-imperative like learn_position_weights?

  • l as a loop index is disfavored for confusion with i/etc. (I even think I've seen the project auto-testing flake8 style rules reject it, though it seems to have passed in your cython code.)

Sorry my review isn't more substantive.

@Witiko
Copy link
Contributor Author

Witiko commented Jul 29, 2020

Thank you for the review.

I don't feel familiar enough with what should be happening to proofread the kinds of indexing-operations that are likely at fault here, so I'd suggest getting right to the deeper debugging with gdb or cygdb, etc.

Just for documentation, c.syn0_positions is a 2 · c.window × c.size float array, i.e. it is indexed the same way as c.syn0_vocab and c.syn0_ngrams and it contains 2 · c.window positional vectors. At training time, a context vector is no longer an average of context word vectors, but an average of context word vectors multiplied (element-wise) with the corresponding positional vectors.

The original paper does not investigate, but the likely outcome is that the positional vectors will grow in magnitude close to the center of the window, so that the impact of more distant words is attenuated. An interesting experiment would be to use just a 2 · c.window × 1 array to see if having separate weights for each of the c.size features is meaningful. If not, then we may get further speed gains (and a publication) out of reducing the positional weight vectors into positional weight scalars.

(And I've often hoped some simple errors could be caught, debugging cython code, by changing the boundscheck to on until debugged... but that's never found a real issue for me. Still, maybe worth a quick try.)

Since I am not too experienced with cygdb and the diff from develop is currently minimal, I am thinking it may be faster to debug via with gil: print(…) first. Indexing is likely at fault, but c.syn0_positions may also point to the wrong thing, I'll need to investigate.

A couple style comments:

  • opaque abbreviations like pdw are generally disfavored, even though some standard-practices from elsewhere (eg sg & hs) are tolerated. If FB's own FastText used pdw as a parameter, maybe it could be used here for consistency, but otherwise something more descriptive would be preferred, especially at the Python/public-API level. Maybe either the full expansion, position_dependent_weighting, or something more-imperative like learn_position_weights?

I was thinking that if this parameter ever appears in FB's code base, pdw is what it would have been named.
Other than that, I agree that a human-readable parameter name is preferred.

  • l as a loop index is disfavored for confusion with i/etc. (I even think I've seen the project auto-testing flake8 style rules reject it, though it seems to have passed in your cython code.)

Good point. Depending on the font, l can also be mistaken for 1 in array indexing, so I suppose it's best to stay away.

Sorry my review isn't more substantive.

It's plenty substantive, thanks. I have experience with C and C++ programming, but I am new to Cython.

@Witiko Witiko force-pushed the position-dependent-weighting branch 21 times, most recently from 27ba384 to dc56ae2 Compare July 29, 2020 16:58
@Witiko
Copy link
Contributor Author

Witiko commented Jul 29, 2020

The issue was that I assumed that c.vocab_lockf[c.indexes[m] % c.vocab_lockf_len] and c.ngrams_lockf[c.subwords_idx[m][o] % c.ngrams_lockf_len] were vectors, while they were just scalars. I updated the code, renamed the parameter pdw to position_dependent_weights, and the variable l to o, as suggested by @gojomo.

Currently, I am training a model to see if I can reproduce the experimental results from the original paper as discussed in issue #2840. If not, it's back to playing lego with Cython. Otherwise, I will be adding my experimental Jupyter Notebook to the PR. So far it seems that the new code is up to 8× slower than develop, which is not too surprising given the lack of BLAS and twice as many vector operations.

Further TODOs include:

  • Speed up by splitting fasttext_fast_sentence_cbow_neg into two functions (optional) and replacing loops with BLAS primitives.
  • Investigate if positional weight vectors can be replaced with simple positional scalar weights, resulting in further speedup.

@Witiko Witiko changed the title Implement position-dependent weighting to fastText [WIP] Implement position-dependent weighting to fastText Jul 30, 2020
@Witiko
Copy link
Contributor Author

Witiko commented Aug 3, 2020

Dense updates of positional weights are a poor fit for Hogwild. Single worker may be required for optimal rate of convergence. Not sure how large of an issue this will be in practice. If it is, perhaps updating only one weight vector in each minibatch could be helpful.

@Witiko
Copy link
Contributor Author

Witiko commented Aug 3, 2020

It also seems that unless we reduce the gradients to the positional vectors, both positional and word vectors tend to NaN. I am currently exploring this issue using positional weight scalars as opposed to positional weight vectors, because the former is currently too computationally expensive to quickly iterate.

When I update positional weights by their full gradients (44e1d83), we end up with both positional weights and word vectors full of NaNs. When I update positional weights by a mean gradient across all dimensions (3233115), believable positional weight vectors that amplify the effect of words close to the center word are trained ([1.07, 1.05, 1.10, 1.28, 2.29, 2.38, 1.33, 1.08, 1.00, 0.99]), although the accuracy on the word analogy task is not improved (70.88% as opposed to 72.19 % without position-dependent weighting).

@gojomo
Copy link
Collaborator

gojomo commented Aug 3, 2020

Analogy scores aren't everything - I've seen cases where a word2vec model was measurably better on the real classification task for which it was needed, even though it was worse for analogies. But, I do see the original PDW paper claimed an improvement on analogies.

However, it looks like the only specific PDW-vs-non-PDW results they report (Table 2 in Section 4) are:

  • on the giant 630B words common-crawl dataset - requiring many days of training. (Are you using a similarly large dataset?)
  • after 6 passes of Phrase-combination are applied – so while after phrasing, there's a PDW improvement, there's no PDW-vs-non-PDW comparison without phrasing. (Perhaps its biggest benefit is somewhat synergistic with phrase-combinations?)
  • using window=5 in the non-PDW cases, but window=15 in the PDW case. To me that raises the question of whether the window-enlargements might be a contributor (or complete cause) of any evaluation improvements

If you're testing just a single scalar per position, rather than a vector (that might allow dimensions of the vectors to spcialize in 'near' or 'far' influences), it strike me you may not be testing the same effect at all, and I wouldn't necessarily expect the same results. (And, an analogy difference of just ~1.31% might just be run-to-run jitter.)

Looking at the weights your experiment learned, it seems only the 'next-door' words have very-different weights (>2x). Perhaps if there is a benefit, simply adopting some simple weighting that's more-near-word-heavy than the linear-dropping via the reduced_windows effect, will offer the benefit without a full optimization on the weights? (Also: it appears your PR doesn't replace the reduced_windows effect, so these weights are additive to the effect of such window-clipping. I can't tell if that was the intent of the original paper, or not.)

@Witiko
Copy link
Contributor Author

Witiko commented Oct 13, 2020

@gojomo To reinforce the argument for changing the initialization: in CBOW, context word embeddings are averaged and then passed to the hidden layer. Whatever the initial distribution (normal or uniform), the average tends to the normal distribution by the central limit theorem. Therefore, the distribution is unimportant as long as the variance of the hidden layer input stays the same.

We don't want to initialize i.e. the positional embeddings to ones, since that makes the gradient updates of word embeddings blow up, as discussed in #2905 (comment). Instead, we want to initialize both positional and word embeddings with the same distribution (either uniform, or normal, or ...) whose product has the same variance (1 / fanout) as the initial word embeddings in vanilla fastText.

@gojomo
Copy link
Collaborator

gojomo commented Oct 13, 2020

If the distribution is unimportant, then faithfully matching the actual word2vec.c/fasttext practices that Gensim claims to re-implement in Python would seem the dominant consideration. I believe every deviation from expectations set by the reference implementations needs a strong justification. Even if it's considered nearly riskless to quality-of-results, at some point someone reading the code will wonder, and then ask on the list or a forum, "why is Gensim using a differently-distributed initialization?" I don't find, "it makes no difference either way, so we chose to be different" a very satisfying answer. (Also: a quick timeit spot check indicates .normal() about 4x slower than .random() - which is likely to be noticeable to users with large models.)

Somewhat relatedly: in current re-factoring of the initialization for other purposes (#2955 & #2975), I've noticed for the 1st time that Google's word2vec.c & Facebook's FastText actually use different initializations for each dimension. word2vec.c uses uniform random over (-0.5/fanout, 0.5/fanout], while FastText uses uniform random over (-1.0/fanout, 1.0/fanout].

@Witiko
Copy link
Contributor Author

Witiko commented Oct 13, 2020

@gojomo

a quick timeit spot check indicates .normal() about 4x slower than .random() - which is likely to be noticeable to users with large models

Thank you for the measurements. I wonder, compared to training time, would the difference in initialization time be significant?

If the distribution is unimportant, then faithfully matching the actual word2vec.c/fasttext practices that Gensim claims to re-implement in Python would seem the dominant consideration. [...] "why is Gensim using a differently-distributed initialization?" I don't find, "it makes no difference either way, so we chose to be different" a very satisfying answer.

To be clear, the switch to normal distribution is not "to be different", but because no reference implementation is available and the uniform distribution is difficult to factorize as discussed in #2905 (comment). We need to factorize, because positional weighting multiplies word and positional vectors before the hidden layer, and it is their product that needs to have the proper distribution.

word2vec.c uses uniform random over (-0.5/fanout, 0.5/fanout], while FastText uses uniform random over (-1.0/fanout, 1.0/fanout].

Interesting! I can't say I know why the choice was made, but it is likely to decrease the initial effective learning rate in word2vec compared to fastText.

@gojomo
Copy link
Collaborator

gojomo commented Oct 13, 2020

Wouldn't initializing the new positional weights however they need to be initialized be sufficient? Then, word-vector/ngram-vector initialization can remain conventional. Separately, even in the absence of a PDW reference implementation, can you ask the PDW paper authors what they actually did, and whether they changed the word-vector initialization in any way?

@Witiko
Copy link
Contributor Author

Witiko commented Oct 13, 2020

@gojomo

Wouldn't initializing the new positional weights however they need to be initialized be sufficient?

That's actually what the current code does: word vectors are initialized with uniform distribution and positional vectors are initialized to ones. This is nice and intuitive, but it leads to exploding gradients of word vectors. Unless we arbitrarily decrease the gradients, the word vectors tend to NaN. Changing the initialization, at least when position-dependent weighting is enabled, would allow us to avoid the arbitrary gradient clipping.

Can you ask the PDW paper authors what they actually did, and whether they changed the word-vector initialization in any way?

I have asked on October 5, but I have received no response yet.

@Witiko
Copy link
Contributor Author

Witiko commented Oct 16, 2020

We held a seminar session regarding the position-dependent weighting with our research group yesterday.
So far, initializing with the square of the normal distribution seems to be the best option, see 1:25:17 -- 1:50:00 in the linked video.

@Witiko Witiko force-pushed the position-dependent-weighting branch 2 times, most recently from 76a6da8 to 109caad Compare October 29, 2020 17:16
@Witiko Witiko force-pushed the position-dependent-weighting branch from 109caad to 94a57ff Compare November 9, 2020 11:32
@Witiko
Copy link
Contributor Author

Witiko commented Nov 9, 2020

@gojomo I bring good news: I got rid of the gradient clipping hack without changing the initialization, i.e. we are keeping the vanilla fastText initialization for vector and n-gram weights and we initialize the positional vectors to ones. The exploding gradients were related to improper normalization of gradient updates to positional weights, which I have fixed in 94a57ff. Since the gradient clipping hack decreased the learning rate for positional weights, we see another 5% increase in the English word analogy task accuracy*:

branch positional-dependent-vectors, pdw=0 branch positional-dependent-vectors, pdw=1, before 94a57ff branch positional-dependent-vectors, pdw=1, after 94a57ff
window=5 65.52%* 70.37%* 74.14%*
window=15 61.60%* 71.01%* 75.02%*

I am currently re-running the vector_size experiment, which was the one most affected by the gradient clipping hack. After that, I will rerun the position_dependent_vector_size experiment, which could catapult us to 80%* accuracy, i.e. within 8% of the current English word analogy SOTA without phrasing and using just the English Wikipedia corpus (4% of Common Crawl). After that, I'll be trying to reproduce the phrasing, which should give us another 5% bump, but this may take some trial and error (if it is reproducible at all) due to the missing parameters in the 2017 “Advances” paper by Mikolov et al. as discussed in the Gensim group. Finally, training on Common Crawl should get us past the English word analogy SOTA. After that, we can be done with the training and get to a proper evaluation on a multitude of end tasks.

@piskvorky
Copy link
Owner

piskvorky commented Nov 14, 2020

That looks awesome. Do I understand correctly the "improper normalization" is present in the current Gensim too?
How about FB's fastText?

@Witiko
Copy link
Contributor Author

Witiko commented Nov 14, 2020

That looks awesome. Do I understand correctly the "improper normalization" is present in the current Gensim too?
How about FB's fastText?

The improper normalization is only related to this PR and the positional vectors, i.e. there should be no issue with either the current Gensim or FB's fastText.

@Witiko Witiko force-pushed the position-dependent-weighting branch 2 times, most recently from fa9dfcf to 94a57ff Compare November 15, 2020 02:34
@Witiko
Copy link
Contributor Author

Witiko commented Nov 20, 2020

I re-ran the vector_size experiment, which was perhaps the one most affected by the gradient clipping hack. The hack decreased the learning rate proportionately to vector_size, so the model would not converge for larger values of vector_size. Below, you can see the results before removing the gradient clipping hack (top) and after removing the gradient clipping hack (bottom):

92793880-22fc2e00-f3af-11ea-842f-1b19d8f9c996
index

We now reach a higher maximum accuracy* on the English word analogy task and never get below the accuracy* of vanilla fastText. I am currently re-running the position_dependent_vector_size experiment, which should catapult us to 80% accuracy*, i.e. within 8% of the current English word analogy SOTA without phrasing and using just the English Wikipedia corpus (4% of Common Crawl).

@Witiko
Copy link
Contributor Author

Witiko commented Nov 22, 2020

@piskvorky @gojomo In two upcoming articles from our research group, we have compared different initializations of positional weighting on the English word analogy task. The results show that although the simplest initialization option discussed in #2905 (comment) (keeping the default initialization for word vectors and initializing positional vectors to ones) achieves strong performance, it is unstable for larger vector dimensionalities, whereas initializing both word vectors and positional vectors with the uniform distribution is both fast and stable.

analogy-task

We have also compared the usefulness of positionally-weighted and basic word vectors for text classification and language modeling and found that both application benefit from the positional word vectors:

text-classification
language-modeling

These results are still using only the positional word vectors trained on the English Wikipedia without phrasing, i.e. they are not the SOTA. Nevertheless, they (a) lay theoretical foundations for the initialization and test them empirically, and (b) lllustrate the general usefulness of positional weighting across two different extrinsic NLP tasks. It's not conclusive evidence, but it's evidence.

@Witiko Witiko force-pushed the position-dependent-weighting branch 2 times, most recently from 064829c to 39cb21a Compare November 30, 2020 13:28
@Witiko Witiko force-pushed the position-dependent-weighting branch from 39cb21a to 8f9110a Compare November 30, 2020 16:35
@piskvorky
Copy link
Owner

piskvorky commented Dec 3, 2020

@Witiko could you check https://groups.google.com/g/gensim/c/5mAeWrQN7lg/m/7Ul-uMJWBAAJ please? (FYI, in case you're not getting the mailing list notifications). Thanks!

@Witiko
Copy link
Contributor Author

Witiko commented Apr 27, 2021

The work from this pull request is currently available in the witiko/gensim@pine fork of Gensim 3.8.3 and through the high-level MIR-MU/PInE library. The architecture and the conducted experiments are described in our (admittedly clickbaity) preprint “When FastText Pays Attention: Efficient Estimation of Word Representations using Constrained Positional Weighting” on arXiv.

@Witiko
Copy link
Contributor Author

Witiko commented Jan 14, 2022

The work from this draft has been accepted as a journal paper, which was co-authored by @piskvorky and is to appear in J.UCS 28:2. As I wrote above, we have produced the high-level PInE library, which was used for the experiments and which uses a fork of Gensim 3.8.3 for model training. If there is interest, we can rebase the fork on top of the current develop branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants