Window alignment options are added for Word2Vec Skip-gram #3460
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
In the original Skip-gram model, the context window includes both the left and right sides of the target word. However, for research purposes, academicians may need to use word embedding models that only consider words placed on one side of the target word in order to improve accuracy without enlarging the window size. For instance, in the paper "Intrinsic and Extrinsic Evaluation of Word Embedding Models" the researchers compared centered (default), left-aligned, and right-aligned window configurations for Turkish language. Need for this feature was also discussed in a StackOverflow post: https://stackoverflow.com/q/63101674/16530078
In this pull request, I attempted to introduce a new parameter to the Word2Vec class, making it easy to use left-right window alignments.
To exemplify, it works like this: If
window_alignment
is set to 0, it takeswindow
many words from both left and right (totaling to 2 *window
many words). Ifwindow_alignment
is -1 or 1, it takeswindow
many words from either only left or right respectively. Of course, ifshrink_windows = True
, it may still effect the number of words being used in the training.