Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loss of token/document tensor at least with PDFMiner #9

Open
omarbenhamid opened this issue Apr 1, 2022 · 2 comments
Open

Loss of token/document tensor at least with PDFMiner #9

omarbenhamid opened this issue Apr 1, 2022 · 2 comments
Assignees
Labels
enhancement New feature or request

Comments

@omarbenhamid
Copy link

Hello,

Thank you for this useful library !

The issue

I had the following issue, with the following code :

import spacy
from spacypdfreader import pdf_reader

nlp = spacy.load("fr_core_news_sm")
doc = pdf_reader('9.PADD_SCOT RM.pdf', nlp)
doc.tensor

I get an empty tensor.

Wheras :

import spacy
from pdfminer import high_level

nlp = spacy.load("fr_dep_news_trf")
doc = nlp(high_level.extract_text(path))
doc.tensor

Returns the right tensor.

Reason

The issue seems to comes from the fact that pdf_reader processess each page as a document and uses Doc.from_docs. It turns out that Doc.from_docs does not preserve Doc.tensor (but it is not found).

@SamEdwardes
Copy link
Owner

Hi omarbenhamid - thank you for creating this issue and looking the problems. I have never encountered this use case, but your explanation makes sense.

The reason each page is processed as a document is so that spacypdfreader can create the page attributes:

  • token._.page_number
  • doc._.page_range
  • doc._.first_page
  • doc._.last_page
  • doc._.pdf_file_name
  • doc._.page(int)

In your use case - do you still require the page number attributes? I think there are a few options:

  1. Update spacypdfreader so that it re-runs at least some of the NLP pipeline after using Doc.from_docs so that the doc object has a tensor, but without overwriting the page number attribute (I am not sure yet how to actually do this, but I imagine it can be done)
  2. Add a parameter to spacypdfreader.pdf_reader that will allow not add the page number attributes and instead run the NLP on the entire text at once. This would be a similar result to your example above.

Please let me know if you have any other ideas or suggestions?

@SamEdwardes SamEdwardes self-assigned this Apr 1, 2022
@SamEdwardes SamEdwardes added the enhancement New feature or request label Apr 1, 2022
@omarbenhamid omarbenhamid changed the title Loss of token/document at least with PDFMiner Loss of token/document tensor at least with PDFMiner Apr 5, 2022
@omarbenhamid
Copy link
Author

Hello SamEdwardes
I opened a discussion with guys at Explosion about behaviour of Doc.from_docs , they are thinking about whether they will fix it in spaCy directly.

Discussion is here : explosion/spaCy#10597

Let's wait and see if they come with a solution.

I worked around the issue from my side by using PDFMiner directly, but I lose the page information in fact ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants