-
Notifications
You must be signed in to change notification settings - Fork 530
Conversation
8fb84d4
to
6168374
Compare
gluonnlp/data/candidate_sampler.py
Outdated
|
||
|
||
@numba_njit | ||
def _candidates_mask(negatives, true_samples, true_samples_mask): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eric-haibin-lin you mentioned you also need the accidental hits masking feature. Please take a look if this would work for you and let me know if you have any suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my case my true and negatives are all reshaped to 1-D, and I just do a simple broadcast comparison.
https://github.com/apache/incubator-mxnet/blob/master/example/rnn/large_word_lm/model.py#L115-L118
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll move the candidate sampler for word embedding learning to the scripts folder, similar to your PR.
a2bdd5f
to
f8ef830
Compare
Job PR-159/6 is complete. |
""" | ||
# Set a few mxnet specific environment variables | ||
import os | ||
os.environ['MXNET_FORCE_ADDTAKEGRAD'] = '1' # Workaround for #11314 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you mind adding the actual link to the issue?
gluonnlp/data/candidate_sampler.py
Outdated
|
||
|
||
@numba_njit | ||
def _candidates_mask(negatives, true_samples, true_samples_mask): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my case my true and negatives are all reshaped to 1-D, and I just do a simple broadcast comparison.
https://github.com/apache/incubator-mxnet/blob/master/example/rnn/large_word_lm/model.py#L115-L118
gluonnlp/data/candidate_sampler.py
Outdated
|
||
# Remove accidental hits | ||
if true_samples is not None: | ||
candidates_np = candidates.asnumpy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@leezu is it more performant to use numpy than nd for sampling here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The conversion to numpy is necessary for using just in time compilation with numba for the _candidates_mask
function. Alternatively the negative samples could be shared among all words in a batch (as commonly done for language modeling https://arxiv.org/abs/1602.02410 http://www.aclweb.org/anthology/N16-1145.pdf), allowing the use of a simple broadcast comparision as in https://github.com/apache/incubator-mxnet/blob/master/example/rnn/large_word_lm/model.py#L115-L118 for computing the mask. On the other hand this would implicitly change the sampling distribution as unrelated words (ie. words that seldomly co-occur in a context) would be more likely masked (given that they may still occur in the same batch)
e9f6198
to
2d2d47c
Compare
Job PR-159/9 is complete. |
(This failed for a few times due to the pandoc OSError when building doc. It passed after rebasing.) |
Job PR-159/10 is complete. |
e121bc9
to
941f6b1
Compare
Job PR-159/12 is complete. |
Please document all the API changes at the top. We should start providing this information in releases. |
Job PR-159/13 is complete. |
256d68e
to
79e1d60
Compare
Codecov Report
@@ Coverage Diff @@
## master #159 +/- ##
=========================================
Coverage ? 67.41%
=========================================
Files ? 62
Lines ? 5109
Branches ? 0
=========================================
Hits ? 3444
Misses ? 1665
Partials ? 0
Continue to review full report at Codecov.
|
ae83748
to
1b4b6f9
Compare
Instead of custom evaluate_fasttext_bin script
Previously __setitem__ was only allowed to update known tokens.
Job PR-159/37 is complete. |
* Mask accidental hits * Simplify frequent token subsampling * Remove tqdm dependency * Simplifications * Support read from vec format * Add back DeduplicatedFasttext * Average the subword embeddings for FastText * Fix Fasttext hash function for ngrams containing non-ASCII data std::string in C++ uses signed char on most implementations. While the behavior is implementation defined and binary Fasttext models trained after compiling Fasttext with different compilers may result in different behavior, let's match the behavior of the officially distributed binary models here. * Merge train_word2vec and train_fasttext * Clean up fasttext evaluation binary script - Fix support of loading bin Fasttext models without subwords * Remove waitall * Only evaluate at end of training by default * Set mxnet env variables * Increase number of subword units considered by default * Update hyperparameters * Fix cbow * Use separate batch-size for evaluation * Fix lint * Rerun extended_results.ipynb and commit dependant results/*tvs files to repo * Refactor TokenEmbedding OOV inference * Clean up TokenEmbedding API docs * Use GluonNLP load_fasttext_model for word embeddings evaluation script Instead of custom evaluate_fasttext_bin script * Add tests * Remove deprecated to_token_embedding method from train/embedding.py * Merge TokenEmbedding.extend in TokenEmbedding.__setitem__ Previously __setitem__ was only allowed to update known tokens. * Use full link to #11314 * Improve test coverage * Update notebook * Fix doc * Cache word ngram hashes * Move results to dmlc/web-data * Move candidate_sampler to scripts * Update --negative doc * Match old default behavior of TokenEmbedding and add warnings * Match weight context in UnigramCandidateSampler * Add Pad test case with empty ndarray input * Address review comments * Fix doc and superfluous inheritance
* Mask accidental hits * Simplify frequent token subsampling * Remove tqdm dependency * Simplifications * Support read from vec format * Add back DeduplicatedFasttext * Average the subword embeddings for FastText * Fix Fasttext hash function for ngrams containing non-ASCII data std::string in C++ uses signed char on most implementations. While the behavior is implementation defined and binary Fasttext models trained after compiling Fasttext with different compilers may result in different behavior, let's match the behavior of the officially distributed binary models here. * Merge train_word2vec and train_fasttext * Clean up fasttext evaluation binary script - Fix support of loading bin Fasttext models without subwords * Remove waitall * Only evaluate at end of training by default * Set mxnet env variables * Increase number of subword units considered by default * Update hyperparameters * Fix cbow * Use separate batch-size for evaluation * Fix lint * Rerun extended_results.ipynb and commit dependant results/*tvs files to repo * Refactor TokenEmbedding OOV inference * Clean up TokenEmbedding API docs * Use GluonNLP load_fasttext_model for word embeddings evaluation script Instead of custom evaluate_fasttext_bin script * Add tests * Remove deprecated to_token_embedding method from train/embedding.py * Merge TokenEmbedding.extend in TokenEmbedding.__setitem__ Previously __setitem__ was only allowed to update known tokens. * Use full link to #11314 * Improve test coverage * Update notebook * Fix doc * Cache word ngram hashes * Move results to dmlc/web-data * Move candidate_sampler to scripts * Update --negative doc * Match old default behavior of TokenEmbedding and add warnings * Match weight context in UnigramCandidateSampler * Add Pad test case with empty ndarray input * Address review comments * Fix doc and superfluous inheritance
Description
This PR contains a few improvements for the word embeddings training and inference.
Checklist
Essentials
Changes
TokenEmbedding.__setitem__
now allows setting vectors for new/unknown tokens by default. Can be disabled by settingallow_extend=False
.unknown_lookup
andunknown_autoextend
arguments. If unknown token is encountered andunknown_lookup
specified,unknown_lookup[tokens]
will be called to obtain an embedding for the unknown token. Ifunknown_autoextend
is True, a new index will be assigned to the token and the embedding will be saved in the TokenEmbedding.unknown_lookup
.load_fasttext_format
toFasttextEmbeddingModel
which supports reading amodel.bin
file created by facebookresearch/fastText library.Comments