You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The issue seems to comes from the fact that pdf_reader processess each page as a document and uses Doc.from_docs. It turns out that Doc.from_docs does not preserve Doc.tensor (but it is not found).
The text was updated successfully, but these errors were encountered:
Hi omarbenhamid - thank you for creating this issue and looking the problems. I have never encountered this use case, but your explanation makes sense.
The reason each page is processed as a document is so that spacypdfreader can create the page attributes:
token._.page_number
doc._.page_range
doc._.first_page
doc._.last_page
doc._.pdf_file_name
doc._.page(int)
In your use case - do you still require the page number attributes? I think there are a few options:
Update spacypdfreader so that it re-runs at least some of the NLP pipeline after using Doc.from_docs so that the doc object has a tensor, but without overwriting the page number attribute (I am not sure yet how to actually do this, but I imagine it can be done)
Add a parameter to spacypdfreader.pdf_reader that will allow not add the page number attributes and instead run the NLP on the entire text at once. This would be a similar result to your example above.
Please let me know if you have any other ideas or suggestions?
Hello SamEdwardes
I opened a discussion with guys at Explosion about behaviour of Doc.from_docs , they are thinking about whether they will fix it in spaCy directly.
Hello,
Thank you for this useful library !
The issue
I had the following issue, with the following code :
I get an empty tensor.
Wheras :
Returns the right tensor.
Reason
The issue seems to comes from the fact that pdf_reader processess each page as a document and uses Doc.from_docs. It turns out that Doc.from_docs does not preserve Doc.tensor (but it is not found).
The text was updated successfully, but these errors were encountered: