Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Window alignment options are added for Word2Vec Skip-gram #3460

Closed
wants to merge 1 commit into from

Conversation

KarahanS
Copy link

In the original Skip-gram model, the context window includes both the left and right sides of the target word. However, for research purposes, academicians may need to use word embedding models that only consider words placed on one side of the target word in order to improve accuracy without enlarging the window size. For instance, in the paper "Intrinsic and Extrinsic Evaluation of Word Embedding Models" the researchers compared centered (default), left-aligned, and right-aligned window configurations for Turkish language. Need for this feature was also discussed in a StackOverflow post: https://stackoverflow.com/q/63101674/16530078

In this pull request, I attempted to introduce a new parameter to the Word2Vec class, making it easy to use left-right window alignments.
To exemplify, it works like this: If window_alignment is set to 0, it takes window many words from both left and right (totaling to 2 * window many words). If window_alignment is -1 or 1, it takes window many words from either only left or right respectively. Of course, if shrink_windows = True, it may still effect the number of words being used in the training.

@piskvorky
Copy link
Owner

piskvorky commented Mar 14, 2023

Makes sense, interesting idea, thanks. The implementation seems clean and not intrusive to existing workflows, both performance-wise and maintenance-wise.

If we do implement this for word2vec though, we should do the same for the other models that accept window: doc2vec, fasttext, corpus_file… Are you able to add that too?

Basically there's a bunch of analogous code paths and we should make them all consistent.

@KarahanS
Copy link
Author

Not immediately, but I think I can work on them. I'll be examining the code paths in detail to come up with the improvements.

@gojomo
Copy link
Collaborator

gojomo commented Mar 15, 2023

Some previous discussion & a prior half-stab at asymmetirc windows in #2172 & #2173.

Has any experiment suggested 1-sided (or otherwise imbalanced) windows offers a benefit for certain needs, and where?

Without any clear example of where lopsided windows might be beneficial, I'd lean against including this as a standard capability.

OTOH, if there is a real benefit to varying this behavior, having the configurability be just "left-only", "right-only", or "both sides", via a window_alignment parameter with special indicator-values {-1, 0, 1}, could be limiting. Why not allow varied windows on either side? Given how the window-loop works, adding a single optional window-offset parameter of some sort might allow arbitrary contiguous windows near the target word.

Or potentially, a mask or weighting-array, a la the fasttext position-dependent weighting of #2905, might allow many other window variations – in shape & relative influence.

@KarahanS KarahanS closed this Apr 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants