[WIP] Implement position-dependent weighting to fastText #2905

Witiko · 2020-07-28T19:13:12Z

This PR adds support for position-dependent weighting for fastText CBOW with negative sampling, as discussed in issue #2840 and facebookresearch/fastText#445. The FastText constructor receives a new boolean parameter position_dependent_weights, which enables the position-dependent weighting when set to 1.

Tasks to complete:

Witiko · 2020-07-28T19:15:05Z

The code currently segfaults immediately after building the vocab. @gojomo, I would appreciate a quick glance over the changes to see if there is something obvious I have missed before I start debugging the Cython code.

gojomo · 2020-07-29T05:38:39Z

I don't feel familiar enough with what should be happening to proofread the kinds of indexing-operations that are likely at fault here, so I'd suggest getting right to the deeper debugging with gdb or cygdb, etc. (And I've often hoped some simple errors could be caught, debugging cython code, by changing the boundscheck to on until debugged... but that's never found a real issue for me. Still, maybe worth a quick try.)

A couple style comments:

opaque abbreviations like pdw are generally disfavored, even though some standard-practices from elsewhere (eg sg & hs) are tolerated. If FB's own FastText used pdw as a parameter, maybe it could be used here for consistency, but otherwise something more descriptive would be preferred, especially at the Python/public-API level. Maybe either the full expansion, position_dependent_weighting, or something more-imperative like learn_position_weights?
l as a loop index is disfavored for confusion with i/etc. (I even think I've seen the project auto-testing flake8 style rules reject it, though it seems to have passed in your cython code.)

Sorry my review isn't more substantive.

Witiko · 2020-07-29T09:32:45Z

Thank you for the review.

I don't feel familiar enough with what should be happening to proofread the kinds of indexing-operations that are likely at fault here, so I'd suggest getting right to the deeper debugging with gdb or cygdb, etc.

Just for documentation, c.syn0_positions is a 2 · c.window × c.size float array, i.e. it is indexed the same way as c.syn0_vocab and c.syn0_ngrams and it contains 2 · c.window positional vectors. At training time, a context vector is no longer an average of context word vectors, but an average of context word vectors multiplied (element-wise) with the corresponding positional vectors.

The original paper does not investigate, but the likely outcome is that the positional vectors will grow in magnitude close to the center of the window, so that the impact of more distant words is attenuated. An interesting experiment would be to use just a 2 · c.window × 1 array to see if having separate weights for each of the c.size features is meaningful. If not, then we may get further speed gains (and a publication) out of reducing the positional weight vectors into positional weight scalars.

(And I've often hoped some simple errors could be caught, debugging cython code, by changing the boundscheck to on until debugged... but that's never found a real issue for me. Still, maybe worth a quick try.)

Since I am not too experienced with cygdb and the diff from develop is currently minimal, I am thinking it may be faster to debug via with gil: print(…) first. Indexing is likely at fault, but c.syn0_positions may also point to the wrong thing, I'll need to investigate.

A couple style comments:

opaque abbreviations like pdw are generally disfavored, even though some standard-practices from elsewhere (eg sg & hs) are tolerated. If FB's own FastText used pdw as a parameter, maybe it could be used here for consistency, but otherwise something more descriptive would be preferred, especially at the Python/public-API level. Maybe either the full expansion, position_dependent_weighting, or something more-imperative like learn_position_weights?

I was thinking that if this parameter ever appears in FB's code base, pdw is what it would have been named.
Other than that, I agree that a human-readable parameter name is preferred.

l as a loop index is disfavored for confusion with i/etc. (I even think I've seen the project auto-testing flake8 style rules reject it, though it seems to have passed in your cython code.)

Good point. Depending on the font, l can also be mistaken for 1 in array indexing, so I suppose it's best to stay away.

Sorry my review isn't more substantive.

It's plenty substantive, thanks. I have experience with C and C++ programming, but I am new to Cython.

Witiko · 2020-07-29T17:03:39Z

The issue was that I assumed that c.vocab_lockf[c.indexes[m] % c.vocab_lockf_len] and c.ngrams_lockf[c.subwords_idx[m][o] % c.ngrams_lockf_len] were vectors, while they were just scalars. I updated the code, renamed the parameter pdw to position_dependent_weights, and the variable l to o, as suggested by @gojomo.

Currently, I am training a model to see if I can reproduce the experimental results from the original paper as discussed in issue #2840. If not, it's back to playing lego with Cython. Otherwise, I will be adding my experimental Jupyter Notebook to the PR. So far it seems that the new code is up to 8× slower than develop, which is not too surprising given the lack of BLAS and twice as many vector operations.

Further TODOs include:

Speed up by splitting fasttext_fast_sentence_cbow_neg into two functions (optional) and replacing loops with BLAS primitives.
Investigate if positional weight vectors can be replaced with simple positional scalar weights, resulting in further speedup.

Witiko · 2020-08-03T16:02:08Z

Dense updates of positional weights are a poor fit for Hogwild. Single worker may be required for optimal rate of convergence. Not sure how large of an issue this will be in practice. If it is, perhaps updating only one weight vector in each minibatch could be helpful.

Witiko · 2020-08-03T18:20:34Z

It also seems that unless we reduce the gradients to the positional vectors, both positional and word vectors tend to NaN. I am currently exploring this issue using positional weight scalars as opposed to positional weight vectors, because the former is currently too computationally expensive to quickly iterate.

When I update positional weights by their full gradients (44e1d83), we end up with both positional weights and word vectors full of NaNs. When I update positional weights by a mean gradient across all dimensions (3233115), believable positional weight vectors that amplify the effect of words close to the center word are trained ([1.07, 1.05, 1.10, 1.28, 2.29, 2.38, 1.33, 1.08, 1.00, 0.99]), although the accuracy on the word analogy task is not improved (70.88% as opposed to 72.19 % without position-dependent weighting).

gojomo · 2020-08-03T20:12:43Z

Analogy scores aren't everything - I've seen cases where a word2vec model was measurably better on the real classification task for which it was needed, even though it was worse for analogies. But, I do see the original PDW paper claimed an improvement on analogies.

However, it looks like the only specific PDW-vs-non-PDW results they report (Table 2 in Section 4) are:

on the giant 630B words common-crawl dataset - requiring many days of training. (Are you using a similarly large dataset?)
after 6 passes of Phrase-combination are applied – so while after phrasing, there's a PDW improvement, there's no PDW-vs-non-PDW comparison without phrasing. (Perhaps its biggest benefit is somewhat synergistic with phrase-combinations?)
using window=5 in the non-PDW cases, but window=15 in the PDW case. To me that raises the question of whether the window-enlargements might be a contributor (or complete cause) of any evaluation improvements

If you're testing just a single scalar per position, rather than a vector (that might allow dimensions of the vectors to spcialize in 'near' or 'far' influences), it strike me you may not be testing the same effect at all, and I wouldn't necessarily expect the same results. (And, an analogy difference of just ~1.31% might just be run-to-run jitter.)

Looking at the weights your experiment learned, it seems only the 'next-door' words have very-different weights (>2x). Perhaps if there is a benefit, simply adopting some simple weighting that's more-near-word-heavy than the linear-dropping via the reduced_windows effect, will offer the benefit without a full optimization on the weights? (Also: it appears your PR doesn't replace the reduced_windows effect, so these weights are additive to the effect of such window-clipping. I can't tell if that was the intent of the original paper, or not.)

Witiko · 2020-10-13T15:25:01Z

@gojomo To reinforce the argument for changing the initialization: in CBOW, context word embeddings are averaged and then passed to the hidden layer. Whatever the initial distribution (normal or uniform), the average tends to the normal distribution by the central limit theorem. Therefore, the distribution is unimportant as long as the variance of the hidden layer input stays the same.

We don't want to initialize i.e. the positional embeddings to ones, since that makes the gradient updates of word embeddings blow up, as discussed in #2905 (comment). Instead, we want to initialize both positional and word embeddings with the same distribution (either uniform, or normal, or ...) whose product has the same variance (1 / fanout) as the initial word embeddings in vanilla fastText.

gojomo · 2020-10-13T15:55:47Z

If the distribution is unimportant, then faithfully matching the actual word2vec.c/fasttext practices that Gensim claims to re-implement in Python would seem the dominant consideration. I believe every deviation from expectations set by the reference implementations needs a strong justification. Even if it's considered nearly riskless to quality-of-results, at some point someone reading the code will wonder, and then ask on the list or a forum, "why is Gensim using a differently-distributed initialization?" I don't find, "it makes no difference either way, so we chose to be different" a very satisfying answer. (Also: a quick timeit spot check indicates .normal() about 4x slower than .random() - which is likely to be noticeable to users with large models.)

Somewhat relatedly: in current re-factoring of the initialization for other purposes (#2955 & #2975), I've noticed for the 1st time that Google's word2vec.c & Facebook's FastText actually use different initializations for each dimension. word2vec.c uses uniform random over (-0.5/fanout, 0.5/fanout], while FastText uses uniform random over (-1.0/fanout, 1.0/fanout].

Witiko · 2020-10-13T16:36:36Z

@gojomo

a quick timeit spot check indicates .normal() about 4x slower than .random() - which is likely to be noticeable to users with large models

Thank you for the measurements. I wonder, compared to training time, would the difference in initialization time be significant?

If the distribution is unimportant, then faithfully matching the actual word2vec.c/fasttext practices that Gensim claims to re-implement in Python would seem the dominant consideration. [...] "why is Gensim using a differently-distributed initialization?" I don't find, "it makes no difference either way, so we chose to be different" a very satisfying answer.

To be clear, the switch to normal distribution is not "to be different", but because no reference implementation is available and the uniform distribution is difficult to factorize as discussed in #2905 (comment). We need to factorize, because positional weighting multiplies word and positional vectors before the hidden layer, and it is their product that needs to have the proper distribution.

word2vec.c uses uniform random over (-0.5/fanout, 0.5/fanout], while FastText uses uniform random over (-1.0/fanout, 1.0/fanout].

Interesting! I can't say I know why the choice was made, but it is likely to decrease the initial effective learning rate in word2vec compared to fastText.

gojomo · 2020-10-13T21:56:20Z

Wouldn't initializing the new positional weights however they need to be initialized be sufficient? Then, word-vector/ngram-vector initialization can remain conventional. Separately, even in the absence of a PDW reference implementation, can you ask the PDW paper authors what they actually did, and whether they changed the word-vector initialization in any way?

Witiko · 2020-10-13T23:40:03Z

@gojomo

Wouldn't initializing the new positional weights however they need to be initialized be sufficient?

That's actually what the current code does: word vectors are initialized with uniform distribution and positional vectors are initialized to ones. This is nice and intuitive, but it leads to exploding gradients of word vectors. Unless we arbitrarily decrease the gradients, the word vectors tend to NaN. Changing the initialization, at least when position-dependent weighting is enabled, would allow us to avoid the arbitrary gradient clipping.

Can you ask the PDW paper authors what they actually did, and whether they changed the word-vector initialization in any way?

I have asked on October 5, but I have received no response yet.

Witiko · 2020-10-16T10:42:21Z

We held a seminar session regarding the position-dependent weighting with our research group yesterday.
So far, initializing with the square of the normal distribution seems to be the best option, see 1:25:17 -- 1:50:00 in the linked video.

Witiko · 2020-11-09T12:06:08Z

@gojomo I bring good news: I got rid of the gradient clipping hack without changing the initialization, i.e. we are keeping the vanilla fastText initialization for vector and n-gram weights and we initialize the positional vectors to ones. The exploding gradients were related to improper normalization of gradient updates to positional weights, which I have fixed in 94a57ff. Since the gradient clipping hack decreased the learning rate for positional weights, we see another 5% increase in the English word analogy task accuracy^*:

	branch `positional-dependent-vectors`, `pdw=0`	branch `positional-dependent-vectors`, `pdw=1`, before `94a57ff`	branch `positional-dependent-vectors`, `pdw=1`, after `94a57ff`
`window=5`	65.52%^*	70.37%^*	74.14%^*
`window=15`	61.60%^*	71.01%^*	75.02%^*

I am currently re-running the vector_size experiment, which was the one most affected by the gradient clipping hack. After that, I will rerun the position_dependent_vector_size experiment, which could catapult us to 80%^* accuracy, i.e. within 8% of the current English word analogy SOTA without phrasing and using just the English Wikipedia corpus (4% of Common Crawl). After that, I'll be trying to reproduce the phrasing, which should give us another 5% bump, but this may take some trial and error (if it is reproducible at all) due to the missing parameters in the 2017 “Advances” paper by Mikolov et al. as discussed in the Gensim group. Finally, training on Common Crawl should get us past the English word analogy SOTA. After that, we can be done with the training and get to a proper evaluation on a multitude of end tasks.

piskvorky · 2020-11-14T13:19:26Z

That looks awesome. Do I understand correctly the "improper normalization" is present in the current Gensim too?
How about FB's fastText?

Witiko · 2020-11-14T15:55:29Z

That looks awesome. Do I understand correctly the "improper normalization" is present in the current Gensim too?
How about FB's fastText?

The improper normalization is only related to this PR and the positional vectors, i.e. there should be no issue with either the current Gensim or FB's fastText.

Witiko · 2020-11-20T09:32:01Z

I re-ran the vector_size experiment, which was perhaps the one most affected by the gradient clipping hack. The hack decreased the learning rate proportionately to vector_size, so the model would not converge for larger values of vector_size. Below, you can see the results before removing the gradient clipping hack (top) and after removing the gradient clipping hack (bottom):

We now reach a higher maximum accuracy^* on the English word analogy task and never get below the accuracy^* of vanilla fastText. I am currently re-running the position_dependent_vector_size experiment, which should catapult us to 80% accuracy^*, i.e. within 8% of the current English word analogy SOTA without phrasing and using just the English Wikipedia corpus (4% of Common Crawl).

Witiko · 2020-11-22T23:11:50Z

@piskvorky @gojomo In two upcoming articles from our research group, we have compared different initializations of positional weighting on the English word analogy task. The results show that although the simplest initialization option discussed in #2905 (comment) (keeping the default initialization for word vectors and initializing positional vectors to ones) achieves strong performance, it is unstable for larger vector dimensionalities, whereas initializing both word vectors and positional vectors with the uniform distribution is both fast and stable.

We have also compared the usefulness of positionally-weighted and basic word vectors for text classification and language modeling and found that both application benefit from the positional word vectors:

These results are still using only the positional word vectors trained on the English Wikipedia without phrasing, i.e. they are not the SOTA. Nevertheless, they (a) lay theoretical foundations for the initialization and test them empirically, and (b) lllustrate the general usefulness of positional weighting across two different extrinsic NLP tasks. It's not conclusive evidence, but it's evidence.

piskvorky · 2020-12-03T00:29:38Z

@Witiko could you check https://groups.google.com/g/gensim/c/5mAeWrQN7lg/m/7Ul-uMJWBAAJ please? (FYI, in case you're not getting the mailing list notifications). Thanks!

Witiko · 2021-04-27T15:05:42Z

The work from this pull request is currently available in the witiko/gensim@pine fork of Gensim 3.8.3 and through the high-level MIR-MU/PInE library. The architecture and the conducted experiments are described in our (admittedly clickbaity) preprint “When FastText Pays Attention: Efficient Estimation of Word Representations using Constrained Positional Weighting” on arXiv.

Witiko · 2022-01-14T16:04:58Z

The work from this draft has been accepted as a journal paper, which was co-authored by @piskvorky and is to appear in J.UCS 28:2. As I wrote above, we have produced the high-level PInE library, which was used for the experiments and which uses a fork of Gensim 3.8.3 for model training. If there is interest, we can rebase the fork on top of the current develop branch.

Witiko marked this pull request as draft July 28, 2020 19:13

Witiko force-pushed the position-dependent-weighting branch 21 times, most recently from 27ba384 to dc56ae2 Compare July 29, 2020 16:58

Witiko changed the title ~~Implement position-dependent weighting to fastText~~ [WIP] Implement position-dependent weighting to fastText Jul 30, 2020

Witiko force-pushed the position-dependent-weighting branch 2 times, most recently from 76a6da8 to 109caad Compare October 29, 2020 17:16

Average accumulated position vector weight updates

94a57ff

Witiko force-pushed the position-dependent-weighting branch from 109caad to 94a57ff Compare November 9, 2020 11:32

Initialize position-dependent weights with uniform distribution

fa9dfcf

Witiko force-pushed the position-dependent-weighting branch 2 times, most recently from fa9dfcf to 94a57ff Compare November 15, 2020 02:34

Witiko force-pushed the position-dependent-weighting branch 2 times, most recently from 064829c to 39cb21a Compare November 30, 2020 13:28

Fix initialization for position_dependent_vector_size < vector_size

8f9110a

Witiko force-pushed the position-dependent-weighting branch from 39cb21a to 8f9110a Compare November 30, 2020 16:35

Witiko mentioned this pull request Jan 3, 2021

Is this a bug in the CBOW code or my misunderstanding? #1873

Closed

pardusnimr mentioned this pull request Jan 28, 2022

Gensim's FastText model reads in unsupported modes from Facebook's FastText #3179

Open

gojomo mentioned this pull request Mar 15, 2023

Window alignment options are added for Word2Vec Skip-gram #3460

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Implement position-dependent weighting to fastText #2905

[WIP] Implement position-dependent weighting to fastText #2905

Witiko commented Jul 28, 2020 •

edited

Loading

Witiko commented Jul 28, 2020

gojomo commented Jul 29, 2020 •

edited by piskvorky

Loading

Witiko commented Jul 29, 2020 •

edited

Loading

Witiko commented Jul 29, 2020 •

edited

Loading

Witiko commented Aug 3, 2020 •

edited

Loading

Witiko commented Aug 3, 2020 •

edited

Loading

gojomo commented Aug 3, 2020

Witiko commented Oct 13, 2020 •

edited

Loading

gojomo commented Oct 13, 2020

Witiko commented Oct 13, 2020 •

edited

Loading

gojomo commented Oct 13, 2020

Witiko commented Oct 13, 2020 •

edited

Loading

Witiko commented Oct 16, 2020 •

edited

Loading

Witiko commented Nov 9, 2020 •

edited

Loading

piskvorky commented Nov 14, 2020 •

edited

Loading

Witiko commented Nov 14, 2020

Witiko commented Nov 20, 2020 •

edited

Loading

Witiko commented Nov 22, 2020 •

edited

Loading

piskvorky commented Dec 3, 2020 •

edited

Loading

Witiko commented Apr 27, 2021 •

edited

Loading

Witiko commented Jan 14, 2022

[WIP] Implement position-dependent weighting to fastText #2905

Are you sure you want to change the base?

[WIP] Implement position-dependent weighting to fastText #2905

Conversation

Witiko commented Jul 28, 2020 • edited Loading

Witiko commented Jul 28, 2020

gojomo commented Jul 29, 2020 • edited by piskvorky Loading

Witiko commented Jul 29, 2020 • edited Loading

Witiko commented Jul 29, 2020 • edited Loading

Witiko commented Aug 3, 2020 • edited Loading

Witiko commented Aug 3, 2020 • edited Loading

gojomo commented Aug 3, 2020

Witiko commented Oct 13, 2020 • edited Loading

gojomo commented Oct 13, 2020

Witiko commented Oct 13, 2020 • edited Loading

gojomo commented Oct 13, 2020

Witiko commented Oct 13, 2020 • edited Loading

Witiko commented Oct 16, 2020 • edited Loading

Witiko commented Nov 9, 2020 • edited Loading

piskvorky commented Nov 14, 2020 • edited Loading

Witiko commented Nov 14, 2020

Witiko commented Nov 20, 2020 • edited Loading

Witiko commented Nov 22, 2020 • edited Loading

piskvorky commented Dec 3, 2020 • edited Loading

Witiko commented Apr 27, 2021 • edited Loading

Witiko commented Jan 14, 2022

Witiko commented Jul 28, 2020 •

edited

Loading

gojomo commented Jul 29, 2020 •

edited by piskvorky

Loading

Witiko commented Jul 29, 2020 •

edited

Loading

Witiko commented Jul 29, 2020 •

edited

Loading

Witiko commented Aug 3, 2020 •

edited

Loading

Witiko commented Aug 3, 2020 •

edited

Loading

Witiko commented Oct 13, 2020 •

edited

Loading

Witiko commented Oct 13, 2020 •

edited

Loading

Witiko commented Oct 13, 2020 •

edited

Loading

Witiko commented Oct 16, 2020 •

edited

Loading

Witiko commented Nov 9, 2020 •

edited

Loading

piskvorky commented Nov 14, 2020 •

edited

Loading

Witiko commented Nov 20, 2020 •

edited

Loading

Witiko commented Nov 22, 2020 •

edited

Loading

piskvorky commented Dec 3, 2020 •

edited

Loading

Witiko commented Apr 27, 2021 •

edited

Loading