Window alignment options are added for Word2Vec Skip-gram #3460

KarahanS · 2023-03-14T16:25:03Z

In the original Skip-gram model, the context window includes both the left and right sides of the target word. However, for research purposes, academicians may need to use word embedding models that only consider words placed on one side of the target word in order to improve accuracy without enlarging the window size. For instance, in the paper "Intrinsic and Extrinsic Evaluation of Word Embedding Models" the researchers compared centered (default), left-aligned, and right-aligned window configurations for Turkish language. Need for this feature was also discussed in a StackOverflow post: https://stackoverflow.com/q/63101674/16530078

In this pull request, I attempted to introduce a new parameter to the Word2Vec class, making it easy to use left-right window alignments.
To exemplify, it works like this: If window_alignment is set to 0, it takes window many words from both left and right (totaling to 2 * window many words). If window_alignment is -1 or 1, it takes window many words from either only left or right respectively. Of course, if shrink_windows = True, it may still effect the number of words being used in the training.

piskvorky · 2023-03-14T17:49:11Z

Makes sense, interesting idea, thanks. The implementation seems clean and not intrusive to existing workflows, both performance-wise and maintenance-wise.

If we do implement this for word2vec though, we should do the same for the other models that accept window: doc2vec, fasttext, corpus_file… Are you able to add that too?

Basically there's a bunch of analogous code paths and we should make them all consistent.

KarahanS · 2023-03-14T20:14:31Z

Not immediately, but I think I can work on them. I'll be examining the code paths in detail to come up with the improvements.

gojomo · 2023-03-15T18:19:54Z

Some previous discussion & a prior half-stab at asymmetirc windows in #2172 & #2173.

Has any experiment suggested 1-sided (or otherwise imbalanced) windows offers a benefit for certain needs, and where?

Without any clear example of where lopsided windows might be beneficial, I'd lean against including this as a standard capability.

OTOH, if there is a real benefit to varying this behavior, having the configurability be just "left-only", "right-only", or "both sides", via a window_alignment parameter with special indicator-values {-1, 0, 1}, could be limiting. Why not allow varied windows on either side? Given how the window-loop works, adding a single optional window-offset parameter of some sort might allow arbitrary contiguous windows near the target word.

Or potentially, a mask or weighting-array, a la the fasttext position-dependent weighting of #2905, might allow many other window variations – in shape & relative influence.

Window alignment options are added for Word2Vec Skip-gram

a022632

KarahanS closed this Apr 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Window alignment options are added for Word2Vec Skip-gram #3460

Window alignment options are added for Word2Vec Skip-gram #3460

KarahanS commented Mar 14, 2023

piskvorky commented Mar 14, 2023 •

edited

Loading

KarahanS commented Mar 14, 2023

gojomo commented Mar 15, 2023

Window alignment options are added for Word2Vec Skip-gram #3460

Window alignment options are added for Word2Vec Skip-gram #3460

Conversation

KarahanS commented Mar 14, 2023

piskvorky commented Mar 14, 2023 • edited Loading

KarahanS commented Mar 14, 2023

gojomo commented Mar 15, 2023

piskvorky commented Mar 14, 2023 •

edited

Loading