-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move corpus classes/utilities to gensim.corpora #2970
base: develop
Are you sure you want to change the base?
Conversation
gensim/corpora/linesentence.py
Outdated
#!/usr/bin/env python | ||
# -*- coding: utf-8 -*- | ||
# | ||
# Copyright (C) 2013 Zygmunt Zając <[email protected]> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@piskvorky Are these copyright statements legit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have taken it from here .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But, even if it was accurate there, this code is from many places & many authors, so it's not appropriate to attribute it all to another person in past time.
In a few cases where many have contributed, I've changed the leading comment to read like (but it should probably say '2020' upon updates):
Still, it'd be good to have a project standard for how authorship/copyright is declared per file. Saying less makes more sense than saying more that isn't true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed with @gojomo . I plan to clarify the developer instructions as part of #2948.
TL;DR: authorship cannot be transferred in many legislations (including EU), it's for life. In order for us (maintainers) to be able to publish & modify the code, we need full assignment of all IP rights from the contributor. Which is what we've been doing implicitly but perhaps should make more explicit, as part of each contribution PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I've seen more complicated processes, and the State of the Art may have advanced since I last looked, my non-lawyerly hunch is that a low-effort system could be: (1) There's a very-simple rights-assignment statement in the repo; (2) To accept someone's contributions, they must 1st (either before or in their 1st PR) include the addition of their Github username to a list of developers granting such assent, at the bottom of that statement. Thus the same mechanism of contribution collects overt/authenticated expressions of assignment, while building a master-list of contributors as a side-effect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I did not know about these things.
I have changed it to
# Author: Gensim Contributors
# Copyright (C) 2020 RaRe Technologies s.r.o.
Sorry again.
gensim/corpora/_mmreader.c
Outdated
@@ -0,0 +1,10902 @@ | |||
/* Generated by Cython 0.29.14 */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This and other .c
/.cpp
files created by Cython compilation should not be added to version control. (In fact, if we think no hand-authored .c
/.cpp
files will be part of the project, a pattern excluding them could be added to .gitignore
.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I also add header files(.h) in gitignore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, some header files are real, like wrap_stdint.h
.
gensim/corpora/taggeddocument.py
Outdated
from gensim import utils | ||
|
||
|
||
class TaggedDocument(namedtuple('TaggedDocument', 'words tags')): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TaggedDocument
is closely-aligned with Doc2Vec
as a description of the 'expected shape' of its training-items. It should stay in that file, as it's not a general-helper applicable to many corpora/uses.
OTOH, the utility class TaggedLineDocument
could stay grouped with other corpora-utility iterables.
I don't think all these small utility iterable classes each deserve their own files/modules. (I especially think this because I'm looking forward to unifying It'd make more sense for them to share a |
@gojomo I have moved all the classes in corpora.utils which also include TaggedDocument because TaggedBrownCorpus and TaggedLineDocument needed TaggedDocument. If I put TaggedDocument in doc2vec it causes Importing error since corpora.utils imported by doc2vec (for TaggedBrownCorpus and TaggedLineDocument) and doc2vec imported by corpora.utils (for TaggedDocument). This can be fixed by importing TaggedBrownCorpus and TaggedLineDocument after defining TaggedDocument like:
Is there anything that I missed? |
gensim/corpora/utils.py
Outdated
from gensim import utils | ||
|
||
try: | ||
from gensim.models.word2vec_inner import MAX_WORDS_IN_BATCH |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There isn't truly any dependency of these utility classes on some other-module Cython optimizations. sp this seems off in terms of direction-of-dependency, fragility, exception raised, etc.
Perhaps instead: a reference value is defined with a comment here, then used both below and imported from here into the other modules that want to be in-sync on this arbitrary cap? EG:
#: Shared capped size, in count of words/tokens, for texts that should not be arbitrarily long
MAX_WORDS=10000
Good point about the order-of-imports, I suppose that means TaggedDocument & TaggedLineDocument should stay together, here. I prefer this current grouping of these utility-readers/format-classes in I would however prefer to re-order them so that they're in roughly-related groupings, and roughly in order-of-prominence. In my mind, that'd be:
|
Regarding class TokenLines(object):
"""
Iterable over already-whitespace-tokenized texts in a line-oriented file.
(Subsumes functionality of former LineSentence & Text8Corpus classes.)
"""
def __init__(self, source, max_sentence_length=10000, limit=None, max_line_length=1024**2, encoding='utf-8'):
self.source = source # file or path
self.max_sentence_length = max_sentence_length # split any texts with more tokens
self.limit = limit # don't read more than this many lines
self.max_line_length = max_line_length # max readline; also once tokens collectively longer, stop extending
def __iter__(self):
try:
self.source.seek(0) # if it seems to be a file, assume directly readable
reader = self.source
except AttributeError:
reader = utils.open(self.source, 'r', encoding=encoding)
with reader:
line_no = 0
tokens = []
slop= ''
line = reader.readline(self.max_line_length)
while line or tokens:
tokens += line.split()
# detect if line maybe not complete, mis-token or mid-line
if len(line) >= self.max_line_length:
line_maybe_truncated = line[-1] != '\n'
token_maybe_truncated = tokens and line.endswith(tokens[-1])
else:
line_maybe_truncated = token_maybe_truncated = False
# while (1) tokens overlong; or (2) not at all truncated; or (3) chars overlong = emit text
while (tokens
and ((len(tokens) > self.max_sentence_length)
or (not (line_maybe_truncated or token_maybe_truncated))
or (sum(len(t) for t in tokens) >= self.max_line_length))):
text, tokens = tokens[0:self.max_sentence_length], tokens[self.max_sentence_length:]
yield text
line_no += 1
if self.limit and line_no >= self.limit:
break
# read next
line = reader.readline(self.max_line_length)
# sew-up potential token fragment
if token_maybe_truncated:
tokens, slop = tokens[:-1], tokens[-1]
line = slop + line |
gensim/corpora/utils.py
Outdated
|
||
logger = logging.getLogger(__name__) | ||
|
||
# Shared capped size, in count of words/tokens, for texts that should not be arbitrarily long |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that you may specifically need to use a "#: ` to start this comment for it to be picked up by Sphinx/autodoc tools as a bit of documentation for the variable assignment. (See: https://stackoverflow.com/a/20227174/130288)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have fixed that thank you.
Solves #2956
I have moved the following classes from word2vec and doc2vec
I have also changed the old location of these files in the comments (but not everywhere like in tests which still import from the last location).
I hope this is the correct method of this.