Multi processing (#14)

SamEdwardes · May 18, 2023 · 24f9d86 · 24f9d86
1 parent 6ac15ac
commit 24f9d86
Show file tree

Hide file tree

Showing 21 changed files with 759 additions and 321 deletions.
diff --git a/.github/workflows/pytest.yml b/.github/workflows/pytest.yml
@@ -9,14 +9,15 @@ on:
   push:
     branches: 
       - main
+  workflow_dispatch:
 
 jobs:
   build:
 
     runs-on: ubuntu-latest
     strategy:
       matrix:
-        python-version: ["3.7", "3.8", "3.9"]
+        python-version: ["3.8", "3.9", "3.10", "3.11"]
 
     steps:
     - uses: actions/checkout@v2
@@ -26,16 +27,15 @@ jobs:
         python-version: ${{ matrix.python-version }}
     - name: Install OCR and PDF dependencies
       run: |
-        sudo apt-get install -y poppler-utils
-        sudo apt install tesseract-ocr -y
-        sudo apt install libtesseract-dev -y
-    - name: Install python dependencies
+        sudo apt-get update
+        sudo apt-get install -y poppler-utils tesseract-ocr libtesseract-dev
+    - name: Setup poetry
       run: |
+        python -m pip install --upgrade pip wheel setuptools
         curl -sSL https://install.python-poetry.org | python3 -
-        poetry export --without-hashes --dev --extras pytesseract --output requirements.txt
-        python -m pip install --upgrade pip
-        pip install wheel
-        pip install -r requirements.txt
+    - name: Install dependencies
+      run: |
+        poetry install --all-extras
     - name: Test with pytest
       run: |
-        pytest
+        poetry run pytest
diff --git a/README.md b/README.md
@@ -59,23 +59,24 @@ pip install 'spacypdfreader[pytesseract]'
 
 ```python
 import spacy
+
 from spacypdfreader import pdf_reader
 
 nlp = spacy.load("en_core_web_sm")
 doc = pdf_reader("tests/data/test_pdf_01.pdf", nlp)
 
 # Get the page number of any token.
 print(doc[0]._.page_number)  # 1
-print(doc[-1]._.page_number) # 4
+print(doc[-1]._.page_number)  # 4
 
 # Get page meta data about the PDF document.
-print(doc._.pdf_file_name)   # "tests/data/test_pdf_01.pdf"
-print(doc._.page_range)      # (1, 4)
-print(doc._.first_page)      # 1
-print(doc._.last_page)       # 4
+print(doc._.pdf_file_name)  # "tests/data/test_pdf_01.pdf"
+print(doc._.page_range)  # (1, 4)
+print(doc._.first_page)  # 1
+print(doc._.last_page)  # 4
 
 # Get all of the text from a specific PDF page.
-print(doc._.page(4))         # "able to display the destination page (unless..."
+print(doc._.page(4))  # "able to display the destination page (unless..."
 ```
 
 ## What is *spaCy*?
@@ -95,15 +96,17 @@ import spacy
 from negspacy.negation import Negex
 
 nlp = spacy.load("en_core_web_sm")
-nlp.add_pipe("negex", config={"ent_types":["PERSON","ORG"]})
+nlp.add_pipe("negex", config={"ent_types": ["PERSON", "ORG"]})
 doc = nlp("She does not like Steve Jobs but likes Apple products.")
 ```
 
 Example of `spaCyPDFreader` usage:
 
 ```python
 import spacy
+
 from spacypdfreader import pdf_reader
+
 nlp = spacy.load("en_core_web_sm")
 
 doc = pdf_reader("tests/data/test_pdf_01.pdf", nlp)

diff --git a/docs/api/spacypdfreader.parsers.md b/docs/api/spacypdfreader.parsers.md
@@ -1,7 +1,5 @@
 # spacypdfreader.parsers
 
-::: spacypdfreader.parsers.base.BaseParser
+::: spacypdfreader.parsers.pdfminer
 
-::: spacypdfreader.parsers.pdfminer.PdfminerParser
-
-::: spacypdfreader.parsers.pytesseract.PytesseractParser
+::: spacypdfreader.parsers.pytesseract
diff --git a/docs/changelog.md b/docs/changelog.md
@@ -1,5 +1,31 @@
 # Changelog
 
+## 0.3.0 (2023-05-17)
+
+**Changes**
+
+- Added support for multi-processing. For example:
+
+    ```python
+    import spacy
+
+    from spacypdfreader.parsers import pytesseract
+    from spacypdfreader.spacypdfreader import pdf_reader
+
+    nlp = spacy.load("en_core_web_sm")
+    doc = pdf_reader("tests/data/test_pdf_01.pdf", nlp, pytesseract.parser, n_processes=4)
+    print(doc._.first_page)
+    print(doc._.last_page)
+    print(doc[12].text)
+    print(doc[12]._.page_number)
+    ```
+
+- Changed the way in which parsers are implemented. They are now implemented with a function as opposed to a class. See <https://github.com/SamEdwardes/spacypdfreader/tree/feature/multi-processing/spacypdfreader/parsers> for examples.
+
+**Fixes**
+
+None
+
 ## 0.2.1 (2022-01-09)
 
 - Added examples to the API docs.

diff --git a/docs/contributing.md b/docs/contributing.md
@@ -5,22 +5,22 @@
 Before merging changes into main the following must be completed:
 
 - [ ] Bump the version number in *pyproject.toml* and *spacypdfreader.__init__.py*
-- [ ] Format the code: `black spacypdfreader`
-- [ ] Run pytest: `pytest`
-- [ ] Check the docs locally: `mkdocs serve`
+- [ ] Format the code: `just format`
+- [ ] Run pytest: `just test`
+- [ ] Check the docs locally: `just preview-docs`
 
 After merging the pull request:
 
 - [ ] Create a new release on GitHub
-- [ ] Publish latest docs to GitHub pages: `rm -rf site; mkdocs build; mkdocs gh-deploy;`
-- [ ] Publish latest package to PyPi: `poetry publish --build`
+- [ ] Publish latest docs to GitHub pages: `just publish-docs`
+- [ ] Publish latest package to PyPi: `just publish`
 
 ## Code style
 
 The black code formatter should be run against all code.
 
 ```bash
-black spacypdfreader
+just format
 ```
 
 ## Documentation
@@ -32,7 +32,7 @@ Documentation is built using [Material for mkdocs](https://squidfunk.github.io/m
 To test the docs locally run the following command:
 
 ```bash
-mkdocs serve
+just preview-docs
 ```
 
 ### Publish the docs
@@ -42,7 +42,5 @@ The docs are hosted on using GitHub pages at [https://samedwardes.github.io/spaC
 Run the following to update the docs:
 
 ```bash
-rm -rf site
-mkdocs build
-mkdocs gh-deploy
+just publish-docs
 ```
diff --git a/docs/hooks.py b/docs/hooks.py
@@ -1,4 +1,5 @@
 import shutil
 
+
 def copy_readme(*args, **kwargs):
-    shutil.copy("README.md", "docs/index.md")
+    shutil.copy("README.md", "docs/index.md")
diff --git a/docs/multiprocessing.md b/docs/multiprocessing.md
@@ -0,0 +1,74 @@
+# Multiprocessing
+
+As of version `0.3.0` spacypdfreader has built in support for multi-processing. This can dramatically improve the time it takes to convert a PDF to text.
+
+## Usage
+
+You can use multiprocessing with an parser.
+
+**pdfminder**
+
+```python
+import spacy
+
+from spacypdfreader.spacypdfreader import pdf_reader
+
+nlp = spacy.load("en_core_web_sm")
+doc = pdf_reader("tests/data/test_pdf_01.pdf", nlp, n_processes=4)
+```
+
+**pytesseract**
+
+```python
+import spacy
+
+from spacypdfreader.parsers import pytesseract
+from spacypdfreader.spacypdfreader import pdf_reader
+
+nlp = spacy.load("en_core_web_sm")
+doc = pdf_reader("tests/data/test_pdf_01.pdf", nlp, pytesseract.parser, n_processes=4)
+```
+
+## Benchmark
+
+```python
+import time
+from functools import wraps
+
+import spacy
+
+from spacypdfreader import pdf_reader
+from spacypdfreader.parsers import pytesseract
+
+
+def timeit(func):
+    @wraps(func)
+    def wrapper(*args, **kwargs):
+        start = time.perf_counter()
+        result = func(*args, **kwargs)
+        end = time.perf_counter()
+        print(f"Took {end - start:.6f} seconds to complete")
+        return result
+
+    return wrapper
+
+
+nlp = spacy.load("en_core_web_sm")
+file_name = "tests/data/wikipedia.pdf"
+
+
+@timeit
+def bench(n_processes):
+    doc = pdf_reader(file_name, nlp, pytesseract.parser, n_processes=n_processes)
+    return doc
+
+
+# With no multiprocessing
+bench(None)
+# Took 42.286371 seconds to complete
+
+# With multiprocessing
+bench(8)
+# Took 9.051591 seconds to complete
+```
+
diff --git a/docs/parsers.md b/docs/parsers.md
@@ -41,6 +41,7 @@ No action required, *pdfminer* will automatically be installed when you install
 
 ```python
 import spacy
+
 from spacypdfreader import pdf_reader
 
 nlp = spacy.load("en_core_web_sm")
@@ -51,13 +52,12 @@ You could also be more verbose and pass in additional parameters. For a list of
 
 ```python
 import spacy
+
 from spacypdfreader import pdf_reader
 from spacypdfreader.parsers.pdfminer import PdfminerParser
 
 nlp = spacy.load("en_core_web_sm")
-params = {
-  "caching": False
-}
+params = {"caching": False}
 doc = pdf_reader("tests/data/test_pdf_01.pdf", nlp, PdfminerParser, **params)
 ```
 
@@ -81,6 +81,7 @@ To use *pytesseract* you must pass the *pytesseract* parser into the `pdf_parser
 
 ```python
 import spacy
+
 from spacypdfreader import pdf_reader
 from spacypdfreader.parsers.pytesseract import PytesseractParser
 
@@ -90,42 +91,16 @@ doc = pdf_reader("tests/data/test_pdf_01.pdf", nlp, PytesseractParser)
 
 ## Bring your own parser
 
-*spacypdfreader* allows your to bring your custom PDF parser. The only requirement is that the parser must have a way for you to specify which page of the PDF document you would like to extract.
-
-The code below demonstrates the implementation of a new custom parser:
-
-```python
-from typing import Any
-
-import spacy
-from pdfminer.high_level import extract_text
-
-from spacypdfreader import pdf_reader
-from spacypdfreader.parsers.base import BaseParser # (1)
-
-
-class CustomParser(BaseParser): # (2)
-    name: str = "custom" # (3)
-
-    def pdf_to_text(self, **kwargs: Any) -> str: # (4)
-        # pdfminer uses zero indexed page numbers. Therefore need to remove 1
-        # from the page count.
-        self.page_number -= 1
-        text = extract_text(self.pdf_path, page_numbers=[self.page_number], **kwargs)
-        return text
-
-
-nlp = spacy.load("en_core_web_sm")
-doc = pdf_reader("tests/data/test_pdf_01.pdf", nlp, CustomParser)
-print(doc._.page_range)  # (1, 4)
-```
+*spacypdfreader* allows your to bring your custom PDF parser. For examples of how to implement your own parser refer to:
 
-1. `BaseParser` is the base class that all parsers inherit from in *spacypdfreader*.
-2. When creating a new class it must inherit from the `BaseParser` class.
-3. The new class must have a `name` attribute.
-4. The new class must have a method called `pdf_to_text`. This method should only convert one pdf page at a time.
+- <https://github.com/SamEdwardes/spacypdfreader/blob/main/spacypdfreader/parsers/pdfminer.py>, or
+- <https://github.com/SamEdwardes/spacypdfreader/blob/main/spacypdfreader/parsers/pytesseract.py>.
 
+To work with spacypdfreader a parser must be a function that:
 
-!!! note
+- Has an argument named `pdf_path`.
+- Has an argument named `page_number`. This argument should use *1 based indexing*. E.g. the value 1 refers to the first page of the PDF.
+- The function should return the text only for a single page of the PDF. This allows spacypdfreader to execute faster with multi-processing.
 
-    *spacypdfreader* uses "1 based indexing". The first page of the PDF is considered page 1, as opposed to page 0.
+!!! warning
+    Version `0.3.0` changed how parsers are implemented. If you have created a custom parser that works with an older version of spacypdfreader it will need to be reimplemented.
diff --git a/justfile b/justfile
@@ -0,0 +1,20 @@
+format:
+    poetry run shed
+
+test:
+    poetry run pytest
+    poetry run pytest --doctest-modules spacypdfreader/
+
+test-gha:
+    gh workflow run pytest.yml --ref $(git branch --show-current)
+
+preview-docs:
+    poetry run mkdocs serve
+
+publish-docs:
+    rm -rf site
+    mkdocs build
+    mkdocs gh-deploy
+
+publish:
+    poetry publish --build
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -68,6 +68,7 @@ extra:
 nav:
   - Home: 'index.md'
   - parsers.md
+  - multiprocessing.md
   - changelog.md
   - contributing.md
   - API reference: