Move corpus classes/utilities to gensim.corpora #2970

rishabhvarshney14 · 2020-10-01T09:19:15Z

I have moved the following classes from word2vec and doc2vec

BrownCorpus
TaggedBrownCorpus
LineSentence
PathLineSentences
TaggedDocument
TaggedLineDocument
Text8Corpus

I have also changed the old location of these files in the comments (but not everywhere like in tests which still import from the last location).

I hope this is the correct method of this.

mpenkov · 2020-10-01T13:54:07Z

gensim/corpora/linesentence.py

+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+#
+# Copyright (C) 2013 Zygmunt Zając <[email protected]>


@piskvorky Are these copyright statements legit?

I have taken it from here .

But, even if it was accurate there, this code is from many places & many authors, so it's not appropriate to attribute it all to another person in past time.

In a few cases where many have contributed, I've changed the leading comment to read like (but it should probably say '2020' upon updates):

https://github.com/RaRe-Technologies/gensim/blob/683cebef27a0bf5e9af39d9b5b0a522e62f20862/gensim/models/doc2vec.py#L4-L5

Still, it'd be good to have a project standard for how authorship/copyright is declared per file. Saying less makes more sense than saying more that isn't true.

Agreed with @gojomo . I plan to clarify the developer instructions as part of #2948.

TL;DR: authorship cannot be transferred in many legislations (including EU), it's for life. In order for us (maintainers) to be able to publish & modify the code, we need full assignment of all IP rights from the contributor. Which is what we've been doing implicitly but perhaps should make more explicit, as part of each contribution PR.

While I've seen more complicated processes, and the State of the Art may have advanced since I last looked, my non-lawyerly hunch is that a low-effort system could be: (1) There's a very-simple rights-assignment statement in the repo; (2) To accept someone's contributions, they must 1st (either before or in their 1st PR) include the addition of their Github username to a list of developers granting such assent, at the bottom of that statement. Thus the same mechanism of contribution collects overt/authenticated expressions of assignment, while building a master-list of contributors as a side-effect.

Sorry I did not know about these things.
I have changed it to

# Author: Gensim Contributors # Copyright (C) 2020 RaRe Technologies s.r.o.

Sorry again.

gojomo · 2020-10-03T02:09:33Z

gensim/corpora/_mmreader.c

@@ -0,0 +1,10902 @@
+/* Generated by Cython 0.29.14 */


This and other .c/.cpp files created by Cython compilation should not be added to version control. (In fact, if we think no hand-authored .c/.cpp files will be part of the project, a pattern excluding them could be added to .gitignore.)

Should I also add header files(.h) in gitignore?

No, some header files are real, like wrap_stdint.h.

gojomo · 2020-10-03T02:18:40Z

gensim/corpora/taggeddocument.py

+from gensim import utils
+
+
+class TaggedDocument(namedtuple('TaggedDocument', 'words tags')):


TaggedDocument is closely-aligned with Doc2Vec as a description of the 'expected shape' of its training-items. It should stay in that file, as it's not a general-helper applicable to many corpora/uses.

OTOH, the utility class TaggedLineDocument could stay grouped with other corpora-utility iterables.

gojomo · 2020-10-03T02:25:47Z

I don't think all these small utility iterable classes each deserve their own files/modules. (I especially think this because I'm looking forward to unifying LineSentence & Text8Corpus into one more-generic class that does them both, per my comment on #2956.)

It'd make more sense for them to share a gensim.corpora.utils home, or perhaps tag onto gensim.corpora.textcorpus (if that weren't already 700 lines & focused on a slightly-different corpus style). The cleanup/movement here may need some organization preference judgement from @piskvorky.

rishabhvarshney14 · 2020-10-05T12:43:29Z

@gojomo I have moved all the classes in corpora.utils which also include TaggedDocument because TaggedBrownCorpus and TaggedLineDocument needed TaggedDocument.

If I put TaggedDocument in doc2vec it causes Importing error since corpora.utils imported by doc2vec (for TaggedBrownCorpus and TaggedLineDocument) and doc2vec imported by corpora.utils (for TaggedDocument).

This can be fixed by importing TaggedBrownCorpus and TaggedLineDocument after defining TaggedDocument like:

class TaggedDocument (...)
from corpora.utils import TaggedBrownCorpus, TaggedLineDocument

Is there anything that I missed?

gojomo · 2020-10-05T15:20:56Z

gensim/corpora/utils.py

+from gensim import utils
+
+try:
+    from gensim.models.word2vec_inner import MAX_WORDS_IN_BATCH


There isn't truly any dependency of these utility classes on some other-module Cython optimizations. sp this seems off in terms of direction-of-dependency, fragility, exception raised, etc.

Perhaps instead: a reference value is defined with a comment here, then used both below and imported from here into the other modules that want to be in-sync on this arbitrary cap? EG:

#: Shared capped size, in count of words/tokens, for texts that should not be arbitrarily long MAX_WORDS=10000

gojomo · 2020-10-05T15:30:22Z

Good point about the order-of-imports, I suppose that means TaggedDocument & TaggedLineDocument should stay together, here.

I prefer this current grouping of these utility-readers/format-classes in gensim.corpora.utils to their previous locations, or individual files for each, but it would be good to hear others' confirmation that this fits their preferences, too.

I would however prefer to re-order them so that they're in roughly-related groupings, and roughly in order-of-prominence. In my mind, that'd be:

LineSentence
PathLineSentence
Text8Corpus  # (but hopefully subsumed into LineSentence someday soon)
TaggedDocument
TaggedLineSentence
BrownCorpus
TaggedBrownCorpus

gojomo · 2020-10-07T03:41:25Z

Regarding LineSentence, here's some code I was playing with that I think provides a superset of LineSentence & Text8Corpus functionality - in that it can iterate over an existing whitespace-tokenized file, and enforce a max-tokens per text, and will still use bounded memory (and emit max-tokens texts) from a single-giant-line file like text8 or text9. It doesn't necessarily need to be part of this PR, but could be considered afterwards:

class TokenLines(object):
    """
    Iterable over already-whitespace-tokenized texts in a line-oriented file.
    
    (Subsumes functionality of former LineSentence & Text8Corpus classes.)
    """
    def __init__(self, source, max_sentence_length=10000, limit=None, max_line_length=1024**2, encoding='utf-8'):
        self.source = source  # file or path
        self.max_sentence_length = max_sentence_length  # split any texts with more tokens
        self.limit = limit  # don't read more than this many lines
        self.max_line_length = max_line_length  # max readline; also once tokens collectively longer, stop extending
        
    def __iter__(self):
        try:
            self.source.seek(0)  # if it seems to be a file, assume directly readable
            reader = self.source
        except AttributeError:
            reader = utils.open(self.source, 'r', encoding=encoding)
            
        with reader:
            line_no = 0
            tokens = []
            slop= ''
            
            line = reader.readline(self.max_line_length)
            while line or tokens:
                tokens += line.split()
                
                # detect if line maybe not complete, mis-token or mid-line
                if len(line) >= self.max_line_length:
                    line_maybe_truncated = line[-1] != '\n'
                    token_maybe_truncated = tokens and line.endswith(tokens[-1])
                else:
                    line_maybe_truncated = token_maybe_truncated = False
                
                # while (1) tokens overlong; or (2) not at all truncated; or (3) chars overlong = emit text
                while (tokens 
                       and ((len(tokens) > self.max_sentence_length) 
                            or (not (line_maybe_truncated or token_maybe_truncated))
                            or (sum(len(t) for t in tokens) >= self.max_line_length))):
                    text, tokens = tokens[0:self.max_sentence_length], tokens[self.max_sentence_length:]
                    yield text
                    line_no += 1
                    
                if self.limit and line_no >= self.limit:
                    break

                # read next
                line = reader.readline(self.max_line_length)

                # sew-up potential token fragment
                if token_maybe_truncated:
                    tokens, slop = tokens[:-1], tokens[-1]
                    line = slop + line

gojomo · 2020-10-07T21:23:32Z

gensim/corpora/utils.py

+
+logger = logging.getLogger(__name__)
+
+# Shared capped size, in count of words/tokens, for texts that should not be arbitrarily long


Note that you may specifically need to use a "#: ` to start this comment for it to be picked up by Sphinx/autodoc tools as a bit of documentation for the variable assignment. (See: https://stackoverflow.com/a/20227174/130288)

I have fixed that thank you.

rishabhvarshney14 added 3 commits October 1, 2020 14:33

Moving corpus to corpora

d9ac923

Moving corpus to corpora

759fb33

corpustocorpora

521ba0f

mpenkov reviewed Oct 1, 2020

View reviewed changes

gojomo changed the title ~~Move copus to corpora~~ Move corpus classes/utilities to gensim.corpora Oct 2, 2020

rishabhvarshney14 added 2 commits October 2, 2020 17:15

movecorpus

2696640

fixed docs

f97c2c6

gojomo reviewed Oct 3, 2020

View reviewed changes

moved classes to corpora utils

4bc62d2

gojomo reviewed Oct 5, 2020

View reviewed changes

reorder classes

c009379

gojomo reviewed Oct 7, 2020

View reviewed changes

rishabhvarshney14 added 2 commits October 8, 2020 19:42

changed comment for autodoc tools

e937136

resolved conflicts

a67dcd8

mpenkov added the stale Waiting for author to complete contribution, no recent effort label Jun 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move corpus classes/utilities to gensim.corpora #2970

Move corpus classes/utilities to gensim.corpora #2970

rishabhvarshney14 commented Oct 1, 2020 •

edited

Loading

mpenkov Oct 1, 2020

rishabhvarshney14 Oct 1, 2020

gojomo Oct 3, 2020

piskvorky Oct 3, 2020 •

edited

Loading

gojomo Oct 3, 2020

rishabhvarshney14 Oct 3, 2020

gojomo Oct 3, 2020 •

edited

Loading

rishabhvarshney14 Oct 3, 2020

mpenkov Oct 3, 2020

gojomo Oct 3, 2020 •

edited

Loading

gojomo commented Oct 3, 2020

rishabhvarshney14 commented Oct 5, 2020

gojomo Oct 5, 2020

gojomo commented Oct 5, 2020

gojomo commented Oct 7, 2020

gojomo Oct 7, 2020

rishabhvarshney14 Oct 8, 2020

		from gensim import utils


		class TaggedDocument(namedtuple('TaggedDocument', 'words tags')):


		logger = logging.getLogger(__name__)

		# Shared capped size, in count of words/tokens, for texts that should not be arbitrarily long

Move corpus classes/utilities to gensim.corpora #2970

Are you sure you want to change the base?

Move corpus classes/utilities to gensim.corpora #2970

Conversation

rishabhvarshney14 commented Oct 1, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piskvorky Oct 3, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gojomo Oct 3, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gojomo Oct 3, 2020 • edited Loading

Choose a reason for hiding this comment

gojomo commented Oct 3, 2020

rishabhvarshney14 commented Oct 5, 2020

Choose a reason for hiding this comment

gojomo commented Oct 5, 2020

gojomo commented Oct 7, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rishabhvarshney14 commented Oct 1, 2020 •

edited

Loading

piskvorky Oct 3, 2020 •

edited

Loading

gojomo Oct 3, 2020 •

edited

Loading

gojomo Oct 3, 2020 •

edited

Loading