Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move corpus classes/utilities to gensim.corpora #2970

Open
wants to merge 9 commits into
base: develop
Choose a base branch
from

Conversation

rishabhvarshney14
Copy link

@rishabhvarshney14 rishabhvarshney14 commented Oct 1, 2020

Solves #2956

I have moved the following classes from word2vec and doc2vec

  • BrownCorpus
  • TaggedBrownCorpus
  • LineSentence
  • PathLineSentences
  • TaggedDocument
  • TaggedLineDocument
  • Text8Corpus

I have also changed the old location of these files in the comments (but not everywhere like in tests which still import from the last location).

I hope this is the correct method of this.

#!/usr/bin/env python
# -*- coding: utf-8 -*-
#
# Copyright (C) 2013 Zygmunt Zając <[email protected]>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@piskvorky Are these copyright statements legit?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have taken it from here .

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But, even if it was accurate there, this code is from many places & many authors, so it's not appropriate to attribute it all to another person in past time.

In a few cases where many have contributed, I've changed the leading comment to read like (but it should probably say '2020' upon updates):

https://github.com/RaRe-Technologies/gensim/blob/683cebef27a0bf5e9af39d9b5b0a522e62f20862/gensim/models/doc2vec.py#L4-L5

Still, it'd be good to have a project standard for how authorship/copyright is declared per file. Saying less makes more sense than saying more that isn't true.

Copy link
Owner

@piskvorky piskvorky Oct 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed with @gojomo . I plan to clarify the developer instructions as part of #2948.

TL;DR: authorship cannot be transferred in many legislations (including EU), it's for life. In order for us (maintainers) to be able to publish & modify the code, we need full assignment of all IP rights from the contributor. Which is what we've been doing implicitly but perhaps should make more explicit, as part of each contribution PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I've seen more complicated processes, and the State of the Art may have advanced since I last looked, my non-lawyerly hunch is that a low-effort system could be: (1) There's a very-simple rights-assignment statement in the repo; (2) To accept someone's contributions, they must 1st (either before or in their 1st PR) include the addition of their Github username to a list of developers granting such assent, at the bottom of that statement. Thus the same mechanism of contribution collects overt/authenticated expressions of assignment, while building a master-list of contributors as a side-effect.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I did not know about these things.
I have changed it to

 # Author: Gensim Contributors 
 # Copyright (C) 2020 RaRe Technologies s.r.o.

Sorry again.

@gojomo gojomo changed the title Move copus to corpora Move corpus classes/utilities to gensim.corpora Oct 2, 2020
@@ -0,0 +1,10902 @@
/* Generated by Cython 0.29.14 */
Copy link
Collaborator

@gojomo gojomo Oct 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and other .c/.cpp files created by Cython compilation should not be added to version control. (In fact, if we think no hand-authored .c/.cpp files will be part of the project, a pattern excluding them could be added to .gitignore.)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I also add header files(.h) in gitignore?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, some header files are real, like wrap_stdint.h.

from gensim import utils


class TaggedDocument(namedtuple('TaggedDocument', 'words tags')):
Copy link
Collaborator

@gojomo gojomo Oct 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TaggedDocument is closely-aligned with Doc2Vec as a description of the 'expected shape' of its training-items. It should stay in that file, as it's not a general-helper applicable to many corpora/uses.

OTOH, the utility class TaggedLineDocument could stay grouped with other corpora-utility iterables.

@gojomo
Copy link
Collaborator

gojomo commented Oct 3, 2020

I don't think all these small utility iterable classes each deserve their own files/modules. (I especially think this because I'm looking forward to unifying LineSentence & Text8Corpus into one more-generic class that does them both, per my comment on #2956.)

It'd make more sense for them to share a gensim.corpora.utils home, or perhaps tag onto gensim.corpora.textcorpus (if that weren't already 700 lines & focused on a slightly-different corpus style). The cleanup/movement here may need some organization preference judgement from @piskvorky.

@rishabhvarshney14
Copy link
Author

@gojomo I have moved all the classes in corpora.utils which also include TaggedDocument because TaggedBrownCorpus and TaggedLineDocument needed TaggedDocument.

If I put TaggedDocument in doc2vec it causes Importing error since corpora.utils imported by doc2vec (for TaggedBrownCorpus and TaggedLineDocument) and doc2vec imported by corpora.utils (for TaggedDocument).

This can be fixed by importing TaggedBrownCorpus and TaggedLineDocument after defining TaggedDocument like:

class TaggedDocument (...)
from corpora.utils import TaggedBrownCorpus, TaggedLineDocument

Is there anything that I missed?

from gensim import utils

try:
from gensim.models.word2vec_inner import MAX_WORDS_IN_BATCH
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There isn't truly any dependency of these utility classes on some other-module Cython optimizations. sp this seems off in terms of direction-of-dependency, fragility, exception raised, etc.

Perhaps instead: a reference value is defined with a comment here, then used both below and imported from here into the other modules that want to be in-sync on this arbitrary cap? EG:

#: Shared capped size, in count of words/tokens, for texts that should not be arbitrarily long
MAX_WORDS=10000

@gojomo
Copy link
Collaborator

gojomo commented Oct 5, 2020

Good point about the order-of-imports, I suppose that means TaggedDocument & TaggedLineDocument should stay together, here.

I prefer this current grouping of these utility-readers/format-classes in gensim.corpora.utils to their previous locations, or individual files for each, but it would be good to hear others' confirmation that this fits their preferences, too.

I would however prefer to re-order them so that they're in roughly-related groupings, and roughly in order-of-prominence. In my mind, that'd be:

LineSentence
PathLineSentence
Text8Corpus  # (but hopefully subsumed into LineSentence someday soon)
TaggedDocument
TaggedLineSentence
BrownCorpus
TaggedBrownCorpus

@gojomo
Copy link
Collaborator

gojomo commented Oct 7, 2020

Regarding LineSentence, here's some code I was playing with that I think provides a superset of LineSentence & Text8Corpus functionality - in that it can iterate over an existing whitespace-tokenized file, and enforce a max-tokens per text, and will still use bounded memory (and emit max-tokens texts) from a single-giant-line file like text8 or text9. It doesn't necessarily need to be part of this PR, but could be considered afterwards:

class TokenLines(object):
    """
    Iterable over already-whitespace-tokenized texts in a line-oriented file.
    
    (Subsumes functionality of former LineSentence & Text8Corpus classes.)
    """
    def __init__(self, source, max_sentence_length=10000, limit=None, max_line_length=1024**2, encoding='utf-8'):
        self.source = source  # file or path
        self.max_sentence_length = max_sentence_length  # split any texts with more tokens
        self.limit = limit  # don't read more than this many lines
        self.max_line_length = max_line_length  # max readline; also once tokens collectively longer, stop extending
        
    def __iter__(self):
        try:
            self.source.seek(0)  # if it seems to be a file, assume directly readable
            reader = self.source
        except AttributeError:
            reader = utils.open(self.source, 'r', encoding=encoding)
            
        with reader:
            line_no = 0
            tokens = []
            slop= ''
            
            line = reader.readline(self.max_line_length)
            while line or tokens:
                tokens += line.split()
                
                # detect if line maybe not complete, mis-token or mid-line
                if len(line) >= self.max_line_length:
                    line_maybe_truncated = line[-1] != '\n'
                    token_maybe_truncated = tokens and line.endswith(tokens[-1])
                else:
                    line_maybe_truncated = token_maybe_truncated = False
                
                # while (1) tokens overlong; or (2) not at all truncated; or (3) chars overlong = emit text
                while (tokens 
                       and ((len(tokens) > self.max_sentence_length) 
                            or (not (line_maybe_truncated or token_maybe_truncated))
                            or (sum(len(t) for t in tokens) >= self.max_line_length))):
                    text, tokens = tokens[0:self.max_sentence_length], tokens[self.max_sentence_length:]
                    yield text
                    line_no += 1
                    
                if self.limit and line_no >= self.limit:
                    break

                # read next
                line = reader.readline(self.max_line_length)

                # sew-up potential token fragment
                if token_maybe_truncated:
                    tokens, slop = tokens[:-1], tokens[-1]
                    line = slop + line


logger = logging.getLogger(__name__)

# Shared capped size, in count of words/tokens, for texts that should not be arbitrarily long
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that you may specifically need to use a "#: ` to start this comment for it to be picked up by Sphinx/autodoc tools as a bit of documentation for the variable assignment. (See: https://stackoverflow.com/a/20227174/130288)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have fixed that thank you.

@mpenkov mpenkov added the stale Waiting for author to complete contribution, no recent effort label Jun 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale Waiting for author to complete contribution, no recent effort
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants