Skip to content

Commit

Permalink
Multi processing (#14)
Browse files Browse the repository at this point in the history
  • Loading branch information
SamEdwardes authored May 18, 2023
1 parent 6ac15ac commit 24f9d86
Show file tree
Hide file tree
Showing 21 changed files with 759 additions and 321 deletions.
20 changes: 10 additions & 10 deletions .github/workflows/pytest.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,14 +9,15 @@ on:
push:
branches:
- main
workflow_dispatch:

jobs:
build:

runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.7", "3.8", "3.9"]
python-version: ["3.8", "3.9", "3.10", "3.11"]

steps:
- uses: actions/checkout@v2
Expand All @@ -26,16 +27,15 @@ jobs:
python-version: ${{ matrix.python-version }}
- name: Install OCR and PDF dependencies
run: |
sudo apt-get install -y poppler-utils
sudo apt install tesseract-ocr -y
sudo apt install libtesseract-dev -y
- name: Install python dependencies
sudo apt-get update
sudo apt-get install -y poppler-utils tesseract-ocr libtesseract-dev
- name: Setup poetry
run: |
python -m pip install --upgrade pip wheel setuptools
curl -sSL https://install.python-poetry.org | python3 -
poetry export --without-hashes --dev --extras pytesseract --output requirements.txt
python -m pip install --upgrade pip
pip install wheel
pip install -r requirements.txt
- name: Install dependencies
run: |
poetry install --all-extras
- name: Test with pytest
run: |
pytest
poetry run pytest
17 changes: 10 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,23 +59,24 @@ pip install 'spacypdfreader[pytesseract]'

```python
import spacy

from spacypdfreader import pdf_reader

nlp = spacy.load("en_core_web_sm")
doc = pdf_reader("tests/data/test_pdf_01.pdf", nlp)

# Get the page number of any token.
print(doc[0]._.page_number) # 1
print(doc[-1]._.page_number) # 4
print(doc[-1]._.page_number) # 4

# Get page meta data about the PDF document.
print(doc._.pdf_file_name) # "tests/data/test_pdf_01.pdf"
print(doc._.page_range) # (1, 4)
print(doc._.first_page) # 1
print(doc._.last_page) # 4
print(doc._.pdf_file_name) # "tests/data/test_pdf_01.pdf"
print(doc._.page_range) # (1, 4)
print(doc._.first_page) # 1
print(doc._.last_page) # 4

# Get all of the text from a specific PDF page.
print(doc._.page(4)) # "able to display the destination page (unless..."
print(doc._.page(4)) # "able to display the destination page (unless..."
```

## What is *spaCy*?
Expand All @@ -95,15 +96,17 @@ import spacy
from negspacy.negation import Negex

nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("negex", config={"ent_types":["PERSON","ORG"]})
nlp.add_pipe("negex", config={"ent_types": ["PERSON", "ORG"]})
doc = nlp("She does not like Steve Jobs but likes Apple products.")
```

Example of `spaCyPDFreader` usage:

```python
import spacy

from spacypdfreader import pdf_reader

nlp = spacy.load("en_core_web_sm")

doc = pdf_reader("tests/data/test_pdf_01.pdf", nlp)
Expand Down
6 changes: 2 additions & 4 deletions docs/api/spacypdfreader.parsers.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,5 @@
# spacypdfreader.parsers

::: spacypdfreader.parsers.base.BaseParser
::: spacypdfreader.parsers.pdfminer

::: spacypdfreader.parsers.pdfminer.PdfminerParser

::: spacypdfreader.parsers.pytesseract.PytesseractParser
::: spacypdfreader.parsers.pytesseract
26 changes: 26 additions & 0 deletions docs/changelog.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,31 @@
# Changelog

## 0.3.0 (2023-05-17)

**Changes**

- Added support for multi-processing. For example:

```python
import spacy

from spacypdfreader.parsers import pytesseract
from spacypdfreader.spacypdfreader import pdf_reader

nlp = spacy.load("en_core_web_sm")
doc = pdf_reader("tests/data/test_pdf_01.pdf", nlp, pytesseract.parser, n_processes=4)
print(doc._.first_page)
print(doc._.last_page)
print(doc[12].text)
print(doc[12]._.page_number)
```

- Changed the way in which parsers are implemented. They are now implemented with a function as opposed to a class. See <https://github.com/SamEdwardes/spacypdfreader/tree/feature/multi-processing/spacypdfreader/parsers> for examples.

**Fixes**

None

## 0.2.1 (2022-01-09)

- Added examples to the API docs.
Expand Down
18 changes: 8 additions & 10 deletions docs/contributing.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,22 +5,22 @@
Before merging changes into main the following must be completed:

- [ ] Bump the version number in *pyproject.toml* and *spacypdfreader.__init__.py*
- [ ] Format the code: `black spacypdfreader`
- [ ] Run pytest: `pytest`
- [ ] Check the docs locally: `mkdocs serve`
- [ ] Format the code: `just format`
- [ ] Run pytest: `just test`
- [ ] Check the docs locally: `just preview-docs`

After merging the pull request:

- [ ] Create a new release on GitHub
- [ ] Publish latest docs to GitHub pages: `rm -rf site; mkdocs build; mkdocs gh-deploy;`
- [ ] Publish latest package to PyPi: `poetry publish --build`
- [ ] Publish latest docs to GitHub pages: `just publish-docs`
- [ ] Publish latest package to PyPi: `just publish`

## Code style

The black code formatter should be run against all code.

```bash
black spacypdfreader
just format
```

## Documentation
Expand All @@ -32,7 +32,7 @@ Documentation is built using [Material for mkdocs](https://squidfunk.github.io/m
To test the docs locally run the following command:

```bash
mkdocs serve
just preview-docs
```

### Publish the docs
Expand All @@ -42,7 +42,5 @@ The docs are hosted on using GitHub pages at [https://samedwardes.github.io/spaC
Run the following to update the docs:

```bash
rm -rf site
mkdocs build
mkdocs gh-deploy
just publish-docs
```
3 changes: 2 additions & 1 deletion docs/hooks.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
import shutil


def copy_readme(*args, **kwargs):
shutil.copy("README.md", "docs/index.md")
shutil.copy("README.md", "docs/index.md")
74 changes: 74 additions & 0 deletions docs/multiprocessing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
# Multiprocessing

As of version `0.3.0` spacypdfreader has built in support for multi-processing. This can dramatically improve the time it takes to convert a PDF to text.

## Usage

You can use multiprocessing with an parser.

**pdfminder**

```python
import spacy

from spacypdfreader.spacypdfreader import pdf_reader

nlp = spacy.load("en_core_web_sm")
doc = pdf_reader("tests/data/test_pdf_01.pdf", nlp, n_processes=4)
```

**pytesseract**

```python
import spacy

from spacypdfreader.parsers import pytesseract
from spacypdfreader.spacypdfreader import pdf_reader

nlp = spacy.load("en_core_web_sm")
doc = pdf_reader("tests/data/test_pdf_01.pdf", nlp, pytesseract.parser, n_processes=4)
```

## Benchmark

```python
import time
from functools import wraps

import spacy

from spacypdfreader import pdf_reader
from spacypdfreader.parsers import pytesseract


def timeit(func):
@wraps(func)
def wrapper(*args, **kwargs):
start = time.perf_counter()
result = func(*args, **kwargs)
end = time.perf_counter()
print(f"Took {end - start:.6f} seconds to complete")
return result

return wrapper


nlp = spacy.load("en_core_web_sm")
file_name = "tests/data/wikipedia.pdf"


@timeit
def bench(n_processes):
doc = pdf_reader(file_name, nlp, pytesseract.parser, n_processes=n_processes)
return doc


# With no multiprocessing
bench(None)
# Took 42.286371 seconds to complete

# With multiprocessing
bench(8)
# Took 9.051591 seconds to complete
```

51 changes: 13 additions & 38 deletions docs/parsers.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@ No action required, *pdfminer* will automatically be installed when you install

```python
import spacy

from spacypdfreader import pdf_reader

nlp = spacy.load("en_core_web_sm")
Expand All @@ -51,13 +52,12 @@ You could also be more verbose and pass in additional parameters. For a list of

```python
import spacy

from spacypdfreader import pdf_reader
from spacypdfreader.parsers.pdfminer import PdfminerParser

nlp = spacy.load("en_core_web_sm")
params = {
"caching": False
}
params = {"caching": False}
doc = pdf_reader("tests/data/test_pdf_01.pdf", nlp, PdfminerParser, **params)
```

Expand All @@ -81,6 +81,7 @@ To use *pytesseract* you must pass the *pytesseract* parser into the `pdf_parser

```python
import spacy

from spacypdfreader import pdf_reader
from spacypdfreader.parsers.pytesseract import PytesseractParser

Expand All @@ -90,42 +91,16 @@ doc = pdf_reader("tests/data/test_pdf_01.pdf", nlp, PytesseractParser)

## Bring your own parser

*spacypdfreader* allows your to bring your custom PDF parser. The only requirement is that the parser must have a way for you to specify which page of the PDF document you would like to extract.

The code below demonstrates the implementation of a new custom parser:

```python
from typing import Any

import spacy
from pdfminer.high_level import extract_text

from spacypdfreader import pdf_reader
from spacypdfreader.parsers.base import BaseParser # (1)


class CustomParser(BaseParser): # (2)
name: str = "custom" # (3)

def pdf_to_text(self, **kwargs: Any) -> str: # (4)
# pdfminer uses zero indexed page numbers. Therefore need to remove 1
# from the page count.
self.page_number -= 1
text = extract_text(self.pdf_path, page_numbers=[self.page_number], **kwargs)
return text


nlp = spacy.load("en_core_web_sm")
doc = pdf_reader("tests/data/test_pdf_01.pdf", nlp, CustomParser)
print(doc._.page_range) # (1, 4)
```
*spacypdfreader* allows your to bring your custom PDF parser. For examples of how to implement your own parser refer to:

1. `BaseParser` is the base class that all parsers inherit from in *spacypdfreader*.
2. When creating a new class it must inherit from the `BaseParser` class.
3. The new class must have a `name` attribute.
4. The new class must have a method called `pdf_to_text`. This method should only convert one pdf page at a time.
- <https://github.com/SamEdwardes/spacypdfreader/blob/main/spacypdfreader/parsers/pdfminer.py>, or
- <https://github.com/SamEdwardes/spacypdfreader/blob/main/spacypdfreader/parsers/pytesseract.py>.

To work with spacypdfreader a parser must be a function that:

!!! note
- Has an argument named `pdf_path`.
- Has an argument named `page_number`. This argument should use *1 based indexing*. E.g. the value 1 refers to the first page of the PDF.
- The function should return the text only for a single page of the PDF. This allows spacypdfreader to execute faster with multi-processing.

*spacypdfreader* uses "1 based indexing". The first page of the PDF is considered page 1, as opposed to page 0.
!!! warning
Version `0.3.0` changed how parsers are implemented. If you have created a custom parser that works with an older version of spacypdfreader it will need to be reimplemented.
20 changes: 20 additions & 0 deletions justfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
format:
poetry run shed

test:
poetry run pytest
poetry run pytest --doctest-modules spacypdfreader/

test-gha:
gh workflow run pytest.yml --ref $(git branch --show-current)

preview-docs:
poetry run mkdocs serve

publish-docs:
rm -rf site
mkdocs build
mkdocs gh-deploy

publish:
poetry publish --build
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,7 @@ extra:
nav:
- Home: 'index.md'
- parsers.md
- multiprocessing.md
- changelog.md
- contributing.md
- API reference:
Expand Down
Loading

0 comments on commit 24f9d86

Please sign in to comment.