Skip to content

Commit

Permalink
Version 0.2.0 (#4)
Browse files Browse the repository at this point in the history
- Added support for additional pdf to text extraction engines:
    - [pytesseract](https://pypi.org/project/pytesseract/)
    - [textract](https://textract.readthedocs.io/en/stable/index.html)
- Added the ability to bring your own pdf to text extraction engine.
- Added new spacy extension attributes and methods:
    - `doc._.page_range`
    - `doc._.first_page`
    - `doc._.last_page`
    - `doc._.pdf_file_name`
    - `doc._.page(int)`
- Built a new documentation site: [https://samedwardes.github.io/spaCyPDFreader/](https://samedwardes.github.io/spaCyPDFreader/)
  • Loading branch information
SamEdwardes authored Dec 30, 2021
1 parent 1b2b31f commit ed830d4
Show file tree
Hide file tree
Showing 26 changed files with 1,398 additions and 233 deletions.
19 changes: 19 additions & 0 deletions .github/workflows/build-docs.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
name: Build documentation
on:
push:
branches:
- main

jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- uses: actions/setup-python@v2
with:
python-version: 3.x
- run: curl -sSL https://install.python-poetry.org | python3 -
- run: poetry export --dev --without-hashes --extras pytesseract --output requirements.txt
- run: pip install wheel
- run: pip install -r requirements.txt
- run: mkdocs gh-deploy --force
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -128,3 +128,7 @@ dmypy.json
# Pyre type checker
.pyre/
.vscode/settings.json
tmp.py

# Added by Sam Edwardes
docs/index.md
3 changes: 0 additions & 3 deletions Makefile

This file was deleted.

143 changes: 69 additions & 74 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,46 +1,81 @@
# spacypdfreader

Extract text from PDFs using spaCy and capture the page number as a spaCy extension.
Easy PDF to text to *spaCy* text extraction in Python.

**Links**
<p>
<a href="https://pypi.org/project/spacypdfreader" target="_blank">
<img src="https://img.shields.io/pypi/v/spacypdfreader?color=%2334D058&label=pypi%20package" alt="Package version">
</a>
<a href="https://github.com/SamEdwardes/spaCyPDFreader/actions/workflows/pytest.yml" target="_blank">
<img src="https://github.com/SamEdwardes/spaCyPDFreader/actions/workflows/pytest.yml/badge.svg" alt="pytest">
</a>
</p>

- [GitHub](https://github.com/SamEdwardes/spaCyPDFreader)
- [PyPi](https://pypi.org/project/spacypdfreader/)
<hr></hr>

**Table of Contents**
**Documentation:** [https://samedwardes.github.io/spaCyPDFreader/](https://samedwardes.github.io/spaCyPDFreader/)

- [Installation](#installation)
- [Usage](#usage)
- [Implementation Notes](#implementation-notes)
- [API Reference](#api-reference)
**Source code:** [https://github.com/SamEdwardes/spaCyPDFreader](https://github.com/SamEdwardes/spaCyPDFreader)

**PyPi:** [https://pypi.org/project/spacypdfreader/](https://pypi.org/project/spacypdfreader/)

<hr></hr>

*spacypdfreader* is a python library for extracting text from PDF documents into *spaCy* `Doc` objects. When you use *spacypdfreader* the token and doc objects from spacy are annotated with additional information about the pdf.

The key features are:

- **PDF to spaCy Doc object:** Convert a PDF document directly into a *spaCy* `Doc` object.
- **Custom spaCy attributes and methods:**
- `token._.page_number`
- `doc._.page_range`
- `doc._.first_page`
- `doc._.last_page`
- `doc._.pdf_file_name`
- `doc._.page(int)`
- **Multiple parsers:** Select between multiple built in PDF to text parsers or bring your own PDF to text parser.

## Installation

Install *spacypdfreader* using pip:

```bash
pip install spacypdfreader
```

## Usage
To install with the required pytesseract dependencies:

```python
>>> import spacy
>>> from spacypdfreader import pdf_reader
>>>
>>> nlp = spacy.load("en_core_web_sm")
>>> doc = pdf_reader("tests/data/test_pdf_01.pdf", nlp)
Extracting text from 4 pdf pages... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00
```bash
pip install 'spacypdfreader[pytesseract]'
```

Each token will now have an additional extension `._.page_number` that indicates the pdf page number the token came from.
## Usage

```python
>>> [print(f"Token: `{token}`, page number {token._.page_number}") for token in doc[0:3]]
Token: `Test`, page number 1
Token: `PDF`, page number 1
Token: `01`, page number 1
[None, None, None]
import spacy
from spacypdfreader import pdf_reader

nlp = spacy.load("en_core_web_sm")
doc = pdf_reader("tests/data/test_pdf_01.pdf", nlp)

# Get the page number of any token.
print(doc[0]._.page_number) # 1
print(doc[-1]._.page_number) # 4

# Get page meta data about the PDF document.
print(doc._.pdf_file_name) # "tests/data/test_pdf_01.pdf"
print(doc._.page_range) # (1, 4)
print(doc._.first_page) # 1
print(doc._.last_page) # 4

# Get all of the text from a specific PDF page.
print(doc._.page(4)) # "able to display the destination page (unless..."
```

## What is *spaCy*?

*spaCy* is a natural language processing (NLP) tool. It can be used to perform a variety of NLP tasks. For more information check out the excellent documentation at [https://spacy.io](https://spacy.io).

## Implementation Notes

spaCyPDFreader behaves a little bit different than your typical [spaCy custom component](https://spacy.io/usage/processing-pipelines#custom-components). Typically a spaCy component should receive and return a `spacy.tokens.Doc` object.
Expand All @@ -50,62 +85,22 @@ spaCyPDFreader breaks this convention because the text must first be extracted f
Example of a "traditional" spaCy pipeline component [negspaCy](https://spacy.io/universe/project/negspacy):

```python
>>> import spacy
>>> from negspacy.negation import Negex
>>>
>>> nlp = spacy.load("en_core_web_sm")
>>> nlp.add_pipe("negex", config={"ent_types":["PERSON","ORG"]})
>>>
>>> doc = nlp("She does not like Steve Jobs but likes Apple products.")
import spacy
from negspacy.negation import Negex

nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("negex", config={"ent_types":["PERSON","ORG"]})
doc = nlp("She does not like Steve Jobs but likes Apple products.")
```

Example of `spaCyPDFreader` usage:

```python
>>> import spacy
>>> from spacypdfreader import pdf_reader
>>>
>>> nlp = spacy.load("en_core_web_sm")
>>> doc = pdf_reader("tests/data/test_pdf_01.pdf", nlp)
Extracting text from 4 pdf pages... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00
```

Note that the `nlp.add_pipe` is not used by spaCyPDFreader.

## API Reference

### Functions
import spacy
from spacypdfreader import pdf_reader
nlp = spacy.load("en_core_web_sm")

### `spacypdfreader.pdf_reader`

Extract text from PDF files directly into a `spacy.Doc` object while capturing the page number of each token.

| Name | Type | Description |
| ------------- | -------------------- | -------------------------------------------------------------------------------------------- |
| `pdf_path` | `str` | Path to a PDF file. |
| `nlp` | `spacy.Language` | A spaCy Language object with a loaded pipeline. For example`spacy.load("en_core_web_sm")`. |
| **RETURNS** | `spacy.tokens.Doc` | A spacy Doc object with the custom extension`._.page_number`. |

**Example**

```python
>>> import spacy
>>> from spacypdfreader import pdf_reader
>>>
>>> nlp = spacy.load("en_core_web_sm")
>>> doc = pdf_reader("tests/data/test_pdf_01.pdf", nlp)
Extracting text from 4 pdf pages... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00
>>> [print(f"Token: `{token}`, page number {token._.page_number}") for token in doc[0:3]]
Token: `Test`, page number 1
Token: `PDF`, page number 1
Token: `01`, page number 1
[None, None, None]
doc = pdf_reader("tests/data/test_pdf_01.pdf", nlp)
```

### Extensions

When using `spacypdfreader.pdf_reader` a `spacy.tokens.Doc` object with custom extensions is returned.

| Extension | Type | Description | Default |
| ------ | ------ | ------ | ------ |
| token._.page_number | int | The PDF page number in which the token was extracted from. The first page is `1`. | `None` |
Note that the `nlp.add_pipe` is not used by spaCyPDFreader.
39 changes: 39 additions & 0 deletions docs/api/spacy-extensions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# spaCy custom extensions

When using [spacypdfreader.spacypdfreader.pdf_reader][] custom attributes and methods are added to spacy objects.

## `spacy.Doc`

### Extension attributes

| Extension | Type | Description |
| ------ | ------ | ------ |
| `doc._.pdf_file_name` | `str` | The file name of the PDF document. |
| `doc._.first_page` | `int` | The first page number of the PDF. |
| `doc._.last_page` | `int` | The last page number of the PDF. |
| `doc._.page_range` | `(int, int)` | The range of pages from the PDF. |
| `doc._.page(int)` | `int` | Return the span of text related to the page. |

### Extension methods

#### `Doc._.page`

**Parameters:**

| Name | Type | Description | Default |
| ------------- | ----- | -------------------------------------------- | ---------- |
| `page_number` | `int` | The PDF page number of the doc to filter on. | *required* |

**Returns:**

| Type | Description |
| ------------ | -------------------------------------------------------- |
| `spacy.Span` | The span of text from the corresponding PDF page number. |

## `spacy.Token`

### Extension attributes

| Extension | Type | Description |
| ------ | ------ | ------ |
| `token._.page_number` | `int` | The PDF page number in which the token was extracted from. The first page is `1`. |
7 changes: 7 additions & 0 deletions docs/api/spacypdfreader.parsers.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# spacypdfreader.parsers

::: spacypdfreader.parsers.base

::: spacypdfreader.parsers.pdfminer

::: spacypdfreader.parsers.pytesseract
3 changes: 3 additions & 0 deletions docs/api/spacypdfreader.spacypdfreader.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# spacypdfreader.spacypdfreader

::: spacypdfreader.spacypdfreader
20 changes: 20 additions & 0 deletions docs/changelog.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# Changelog

## 0.2.0 (2021-12-10)

- Added support for additional pdf to text extraction engines:
- [pytesseract](https://pypi.org/project/pytesseract/)
- [textract](https://textract.readthedocs.io/en/stable/index.html)
- Added the ability to bring your own pdf to text extraction engine.
- Added new spacy extension attributes and methods:
- `doc._.page_range`
- `doc._.first_page`
- `doc._.last_page`
- `doc._.pdf_file_name`
- `doc._.page(int)`
- Built a new documentation site: [https://samedwardes.github.io/spaCyPDFreader/](https://samedwardes.github.io/spaCyPDFreader/)

## 0.1.1 (2021-12-10)

- 0.1.1 Python ^3.7 support by @SamEdwardes in [https://github.com/SamEdwardes/spaCyPDFreader/pull/2](https://github.com/SamEdwardes/spaCyPDFreader/pull/2)

25 changes: 25 additions & 0 deletions docs/contributing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# Contributing

## Code style

The black code formatter should be run against all code.

```bash
black spacypdfreader
```

## Documentation

Documentation is built using [Material for mkdocs](https://squidfunk.github.io/mkdocs-material/). All of the documentations lives within the `docs/` directory.

### Test the docs locally

To test the docs locally run the following command:

```bash
mkdocs serve
```

### Publish the docs

The docs are hosted on using GitHub pages at [https://samedwardes.github.io/spaCyPDFreader/contributing/](https://samedwardes.github.io/spaCyPDFreader/contributing/). Every commit or pull request against the main branch will trigger a build.
4 changes: 4 additions & 0 deletions docs/hooks.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
import shutil

def copy_readme(*args, **kwargs):
shutil.copy("README.md", "docs/index.md")
Loading

0 comments on commit ed830d4

Please sign in to comment.