Skip to content

Releases: SamEdwardes/spacypdfreader

0.3.2

04 Oct 23:39
fa15bd5
Compare
Choose a tag to compare

Changes

  • Support for Python 3.8 to 3.12 and all future 3.0 versions of Python (#16, #21)
  • Added local testing to test matrix of supported Python versions.
  • Switch from poetry to uv for managing project dependencies and building project.
  • Update dependencies.

Fixes

None

0.3.1

17 Oct 16:16
802ec31
Compare
Choose a tag to compare

Changes

  • Support for page_range argument (#16, #18).

    import spacy
    from spacypdfreader import pdf_reader
    from spacypdfreader.parsers import pytesseract
    
    nlp = spacy.load("en_core_web_sm")
    doc = pdf_reader("tests/data/test_pdf_01.pdf", nlp, pytesseract.parser, n_processes=4, page_range=(2, 3))

Fixes

  • Remove shed as a dependency. It was removing unused imports that were required (#17).

0.3.0

18 May 03:32
24f9d86
Compare
Choose a tag to compare

Changes

  • Added support for multi-processing. For example:

    import spacy
    
    from spacypdfreader.parsers import pytesseract
    from spacypdfreader.spacypdfreader import pdf_reader
    
    nlp = spacy.load("en_core_web_sm")
    doc = pdf_reader("tests/data/test_pdf_01.pdf", nlp, pytesseract.parser, n_processes=4)
    print(doc._.first_page)
    print(doc._.last_page)
    print(doc[12].text)
    print(doc[12]._.page_number)
  • Changed the way in which parsers are implemented. They are now implemented with a function as opposed to a class. See https://github.com/SamEdwardes/spacypdfreader/tree/feature/multi-processing/spacypdfreader/parsers for examples.

Fixes

None

0.2.1

09 Jan 18:01
f995ea1
Compare
Choose a tag to compare
  • Added examples to the API docs.
  • Added deployment checklist to the docs.

0.2.0

30 Dec 19:12
ed830d4
Compare
Choose a tag to compare
  • Added support for additional pdf to text extraction engines:
  • Added the ability to bring your own pdf to text extraction engine.
  • Added new spacy extension attributes and methods:
    • doc._.page_range
    • doc._.first_page
    • doc._.last_page
    • doc._.pdf_file_name
    • doc._.page(int)
  • Built a new documentation site: https://samedwardes.github.io/spaCyPDFreader/

0.1.1

10 Dec 19:10
ec3bb14
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: https://github.com/SamEdwardes/spaCyPDFreader/commits/v0.1.1