Version 0.2.0 (#4)

- Added support for additional pdf to text extraction engines: - [pytesseract](https://pypi.org/project/pytesseract/) - [textract](https://textract.readthedocs.io/en/stable/index.html) - Added the ability to bring your own pdf to text extraction engine. - Added new spacy extension attributes and methods: - `doc._.page_range` - `doc._.first_page` - `doc._.last_page` - `doc._.pdf_file_name` - `doc._.page(int)` - Built a new documentation site: [https://samedwardes.github.io/spaCyPDFreader/](https://samedwardes.github.io/spaCyPDFreader/)
SamEdwardes · Dec 30, 2021 · ed830d4 · ed830d4
1 parent 1b2b31f
commit ed830d4
Show file tree

Hide file tree

Showing 26 changed files with 1,398 additions and 233 deletions.
diff --git a/.github/workflows/build-docs.yml b/.github/workflows/build-docs.yml
@@ -0,0 +1,19 @@
+name: Build documentation
+on:
+  push:
+    branches: 
+      - main
+
+jobs:
+  deploy:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v2
+      - uses: actions/setup-python@v2
+        with:
+          python-version: 3.x
+      - run: curl -sSL https://install.python-poetry.org | python3 -
+      - run: poetry export --dev --without-hashes --extras pytesseract --output requirements.txt
+      - run: pip install wheel
+      - run: pip install -r requirements.txt
+      - run: mkdocs gh-deploy --force
diff --git a/.gitignore b/.gitignore
@@ -128,3 +128,7 @@ dmypy.json
 # Pyre type checker
 .pyre/
 .vscode/settings.json
+tmp.py
+
+# Added by Sam Edwardes
+docs/index.md
diff --git a/Makefile b/Makefile
diff --git a/README.md b/README.md
@@ -1,46 +1,81 @@
 # spacypdfreader
 
-Extract text from PDFs using spaCy and capture the page number as a spaCy extension.
+Easy PDF to text to *spaCy* text extraction in Python.
 
-**Links**
+<p>
+    <a href="https://pypi.org/project/spacypdfreader" target="_blank">
+        <img src="https://img.shields.io/pypi/v/spacypdfreader?color=%2334D058&label=pypi%20package" alt="Package version">
+    </a>
+    <a href="https://github.com/SamEdwardes/spaCyPDFreader/actions/workflows/pytest.yml" target="_blank">
+        <img src="https://github.com/SamEdwardes/spaCyPDFreader/actions/workflows/pytest.yml/badge.svg" alt="pytest">
+    </a>
+</p>
 
-- [GitHub](https://github.com/SamEdwardes/spaCyPDFreader)
-- [PyPi](https://pypi.org/project/spacypdfreader/)
+<hr></hr>
 
-**Table of Contents**
+**Documentation:** [https://samedwardes.github.io/spaCyPDFreader/](https://samedwardes.github.io/spaCyPDFreader/)
 
-- [Installation](#installation)
-- [Usage](#usage)
-- [Implementation Notes](#implementation-notes)
-- [API Reference](#api-reference)
+**Source code:** [https://github.com/SamEdwardes/spaCyPDFreader](https://github.com/SamEdwardes/spaCyPDFreader)
+
+**PyPi:** [https://pypi.org/project/spacypdfreader/](https://pypi.org/project/spacypdfreader/)
+
+<hr></hr>
+
+*spacypdfreader* is a python library for extracting text from PDF documents into *spaCy* `Doc` objects. When you use *spacypdfreader* the token and doc objects from spacy are annotated with additional information about the pdf.
+
+The key features are:
+
+- **PDF to spaCy Doc object:** Convert a PDF document directly into a *spaCy* `Doc` object.
+- **Custom spaCy attributes and methods:**
+    - `token._.page_number`
+    - `doc._.page_range`
+    - `doc._.first_page`
+    - `doc._.last_page`
+    - `doc._.pdf_file_name`
+    - `doc._.page(int)`
+- **Multiple parsers:** Select between multiple built in PDF to text parsers or bring your own PDF to text parser.
 
 ## Installation
 
+Install *spacypdfreader* using pip:
+
 ```bash
 pip install spacypdfreader
 ```
 
-## Usage
+To install with the required pytesseract dependencies:
 
-```python
->>> import spacy
->>> from spacypdfreader import pdf_reader
->>>
->>> nlp = spacy.load("en_core_web_sm")
->>> doc = pdf_reader("tests/data/test_pdf_01.pdf", nlp)
-Extracting text from 4 pdf pages... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00
+```bash
+pip install 'spacypdfreader[pytesseract]'
 ```
 
-Each token will now have an additional extension `._.page_number` that indicates the pdf page number the token came from.
+## Usage
 
 ```python
->>> [print(f"Token: `{token}`, page number  {token._.page_number}") for token in doc[0:3]]
-Token: `Test`, page number  1
-Token: `PDF`, page number  1
-Token: `01`, page number  1
-[None, None, None]
+import spacy
+from spacypdfreader import pdf_reader
+
+nlp = spacy.load("en_core_web_sm")
+doc = pdf_reader("tests/data/test_pdf_01.pdf", nlp)
+
+# Get the page number of any token.
+print(doc[0]._.page_number)  # 1
+print(doc[-1]._.page_number) # 4
+
+# Get page meta data about the PDF document.
+print(doc._.pdf_file_name)   # "tests/data/test_pdf_01.pdf"
+print(doc._.page_range)      # (1, 4)
+print(doc._.first_page)      # 1
+print(doc._.last_page)       # 4
+
+# Get all of the text from a specific PDF page.
+print(doc._.page(4))         # "able to display the destination page (unless..."
 ```
 
+## What is *spaCy*?
+
+*spaCy* is a natural language processing (NLP) tool. It can be used to perform a variety of NLP tasks. For more information check out the excellent documentation at [https://spacy.io](https://spacy.io).
+
 ## Implementation Notes
 
 spaCyPDFreader behaves a little bit different than your typical [spaCy custom component](https://spacy.io/usage/processing-pipelines#custom-components). Typically a spaCy component should receive and return a `spacy.tokens.Doc` object.
@@ -50,62 +85,22 @@ spaCyPDFreader breaks this convention because the text must first be extracted f
 Example of a "traditional" spaCy pipeline component [negspaCy](https://spacy.io/universe/project/negspacy):
 
 ```python
->>> import spacy
->>> from negspacy.negation import Negex
->>> 
->>> nlp = spacy.load("en_core_web_sm")
->>> nlp.add_pipe("negex", config={"ent_types":["PERSON","ORG"]})
->>> 
->>> doc = nlp("She does not like Steve Jobs but likes Apple products.")
+import spacy
+from negspacy.negation import Negex
+
+nlp = spacy.load("en_core_web_sm")
+nlp.add_pipe("negex", config={"ent_types":["PERSON","ORG"]})
+doc = nlp("She does not like Steve Jobs but likes Apple products.")
 ```
 
 Example of `spaCyPDFreader` usage:
 
 ```python
->>> import spacy
->>> from spacypdfreader import pdf_reader
->>>
->>> nlp = spacy.load("en_core_web_sm")
->>> doc = pdf_reader("tests/data/test_pdf_01.pdf", nlp)
-Extracting text from 4 pdf pages... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00
-```
-
-Note that the `nlp.add_pipe` is not used by spaCyPDFreader.
-
-## API Reference
-
-### Functions
+import spacy
+from spacypdfreader import pdf_reader
+nlp = spacy.load("en_core_web_sm")
 
-### `spacypdfreader.pdf_reader`
-
-Extract text from PDF files directly into a `spacy.Doc` object while capturing the page number of each token.
-
-| Name        | Type               | Description                                                                                |
-| ------------- | -------------------- | -------------------------------------------------------------------------------------------- |
-| `pdf_path`  | `str`              | Path to a PDF file.                                                                        |
-| `nlp`       | `spacy.Language`   | A spaCy Language object with a loaded pipeline. For example`spacy.load("en_core_web_sm")`. |
-| **RETURNS** | `spacy.tokens.Doc` | A spacy Doc object with the custom extension`._.page_number`.                              |
-
-**Example**
-
-```python
->>> import spacy
->>> from spacypdfreader import pdf_reader
->>>
->>> nlp = spacy.load("en_core_web_sm")
->>> doc = pdf_reader("tests/data/test_pdf_01.pdf", nlp)
-Extracting text from 4 pdf pages... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00
->>> [print(f"Token: `{token}`, page number  {token._.page_number}") for token in doc[0:3]]
-Token: `Test`, page number  1
-Token: `PDF`, page number  1
-Token: `01`, page number  1
-[None, None, None]
+doc = pdf_reader("tests/data/test_pdf_01.pdf", nlp)
 ```
 
-### Extensions
-
-When using `spacypdfreader.pdf_reader` a `spacy.tokens.Doc` object with custom extensions is returned.
-
-| Extension   | Type   | Description   | Default   |
-| ------ | ------ | ------ | ------ |
-| token._.page_number |  int      | The PDF page number in which the token was extracted from. The first page is `1`.      |  `None`      |
+Note that the `nlp.add_pipe` is not used by spaCyPDFreader.
diff --git a/docs/api/spacy-extensions.md b/docs/api/spacy-extensions.md
@@ -0,0 +1,39 @@
+# spaCy custom extensions
+
+When using [spacypdfreader.spacypdfreader.pdf_reader][] custom attributes and methods are added to spacy objects.
+
+## `spacy.Doc` 
+
+### Extension attributes
+
+| Extension   | Type   | Description   |
+| ------ | ------ | ------ |
+| `doc._.pdf_file_name` | `str` | The file name of the PDF document. |
+| `doc._.first_page` | `int` | The first page number of the PDF. |
+| `doc._.last_page` | `int` | The last page number of the PDF. |
+| `doc._.page_range` | `(int, int)` | The range of pages from the PDF. |
+| `doc._.page(int)` | `int` | Return the span of text related to the page. |
+
+### Extension methods
+
+#### `Doc._.page`
+
+**Parameters:**
+
+| Name          | Type  | Description                                  | Default    |
+| ------------- | ----- | -------------------------------------------- | ---------- |
+| `page_number` | `int` | The PDF page number of the doc to filter on. | *required* |
+
+**Returns:**
+
+| Type         | Description                                              |
+| ------------ | -------------------------------------------------------- |
+| `spacy.Span` | The span of text from the corresponding PDF page number. |
+
+## `spacy.Token`
+
+### Extension attributes
+
+| Extension   | Type   | Description   |
+| ------ | ------ | ------ |
+| `token._.page_number` |  `int`      | The PDF page number in which the token was extracted from. The first page is `1`.      |
diff --git a/docs/api/spacypdfreader.parsers.md b/docs/api/spacypdfreader.parsers.md
@@ -0,0 +1,7 @@
+# spacypdfreader.parsers
+
+::: spacypdfreader.parsers.base
+
+::: spacypdfreader.parsers.pdfminer
+
+::: spacypdfreader.parsers.pytesseract
diff --git a/docs/api/spacypdfreader.spacypdfreader.md b/docs/api/spacypdfreader.spacypdfreader.md
@@ -0,0 +1,3 @@
+# spacypdfreader.spacypdfreader
+
+::: spacypdfreader.spacypdfreader
diff --git a/docs/changelog.md b/docs/changelog.md
@@ -0,0 +1,20 @@
+# Changelog
+
+## 0.2.0 (2021-12-10)
+
+- Added support for additional pdf to text extraction engines:
+    - [pytesseract](https://pypi.org/project/pytesseract/)
+    - [textract](https://textract.readthedocs.io/en/stable/index.html)
+- Added the ability to bring your own pdf to text extraction engine.
+- Added new spacy extension attributes and methods:
+    - `doc._.page_range`
+    - `doc._.first_page`
+    - `doc._.last_page`
+    - `doc._.pdf_file_name`
+    - `doc._.page(int)`
+- Built a new documentation site: [https://samedwardes.github.io/spaCyPDFreader/](https://samedwardes.github.io/spaCyPDFreader/)
+
+## 0.1.1 (2021-12-10)
+
+- 0.1.1 Python ^3.7 support by @SamEdwardes in [https://github.com/SamEdwardes/spaCyPDFreader/pull/2](https://github.com/SamEdwardes/spaCyPDFreader/pull/2)
+
diff --git a/docs/contributing.md b/docs/contributing.md
@@ -0,0 +1,25 @@
+# Contributing
+
+## Code style
+
+The black code formatter should be run against all code.
+
+```bash
+black spacypdfreader
+```
+
+## Documentation
+
+Documentation is built using [Material for mkdocs](https://squidfunk.github.io/mkdocs-material/). All of the documentations lives within the `docs/` directory.
+
+### Test the docs locally
+
+To test the docs locally run the following command:
+
+```bash
+mkdocs serve
+```
+
+### Publish the docs
+
+The docs are hosted on using GitHub pages at [https://samedwardes.github.io/spaCyPDFreader/contributing/](https://samedwardes.github.io/spaCyPDFreader/contributing/). Every commit or pull request against the main branch will trigger a build.
diff --git a/docs/hooks.py b/docs/hooks.py
@@ -0,0 +1,4 @@
+import shutil
+
+def copy_readme(*args, **kwargs):
+    shutil.copy("README.md", "docs/index.md")