-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Added support for additional pdf to text extraction engines: - [pytesseract](https://pypi.org/project/pytesseract/) - [textract](https://textract.readthedocs.io/en/stable/index.html) - Added the ability to bring your own pdf to text extraction engine. - Added new spacy extension attributes and methods: - `doc._.page_range` - `doc._.first_page` - `doc._.last_page` - `doc._.pdf_file_name` - `doc._.page(int)` - Built a new documentation site: [https://samedwardes.github.io/spaCyPDFreader/](https://samedwardes.github.io/spaCyPDFreader/)
- Loading branch information
1 parent
1b2b31f
commit ed830d4
Showing
26 changed files
with
1,398 additions
and
233 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,19 @@ | ||
name: Build documentation | ||
on: | ||
push: | ||
branches: | ||
- main | ||
|
||
jobs: | ||
deploy: | ||
runs-on: ubuntu-latest | ||
steps: | ||
- uses: actions/checkout@v2 | ||
- uses: actions/setup-python@v2 | ||
with: | ||
python-version: 3.x | ||
- run: curl -sSL https://install.python-poetry.org | python3 - | ||
- run: poetry export --dev --without-hashes --extras pytesseract --output requirements.txt | ||
- run: pip install wheel | ||
- run: pip install -r requirements.txt | ||
- run: mkdocs gh-deploy --force |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -128,3 +128,7 @@ dmypy.json | |
# Pyre type checker | ||
.pyre/ | ||
.vscode/settings.json | ||
tmp.py | ||
|
||
# Added by Sam Edwardes | ||
docs/index.md |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,39 @@ | ||
# spaCy custom extensions | ||
|
||
When using [spacypdfreader.spacypdfreader.pdf_reader][] custom attributes and methods are added to spacy objects. | ||
|
||
## `spacy.Doc` | ||
|
||
### Extension attributes | ||
|
||
| Extension | Type | Description | | ||
| ------ | ------ | ------ | | ||
| `doc._.pdf_file_name` | `str` | The file name of the PDF document. | | ||
| `doc._.first_page` | `int` | The first page number of the PDF. | | ||
| `doc._.last_page` | `int` | The last page number of the PDF. | | ||
| `doc._.page_range` | `(int, int)` | The range of pages from the PDF. | | ||
| `doc._.page(int)` | `int` | Return the span of text related to the page. | | ||
|
||
### Extension methods | ||
|
||
#### `Doc._.page` | ||
|
||
**Parameters:** | ||
|
||
| Name | Type | Description | Default | | ||
| ------------- | ----- | -------------------------------------------- | ---------- | | ||
| `page_number` | `int` | The PDF page number of the doc to filter on. | *required* | | ||
|
||
**Returns:** | ||
|
||
| Type | Description | | ||
| ------------ | -------------------------------------------------------- | | ||
| `spacy.Span` | The span of text from the corresponding PDF page number. | | ||
|
||
## `spacy.Token` | ||
|
||
### Extension attributes | ||
|
||
| Extension | Type | Description | | ||
| ------ | ------ | ------ | | ||
| `token._.page_number` | `int` | The PDF page number in which the token was extracted from. The first page is `1`. | |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
# spacypdfreader.parsers | ||
|
||
::: spacypdfreader.parsers.base | ||
|
||
::: spacypdfreader.parsers.pdfminer | ||
|
||
::: spacypdfreader.parsers.pytesseract |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
# spacypdfreader.spacypdfreader | ||
|
||
::: spacypdfreader.spacypdfreader |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
# Changelog | ||
|
||
## 0.2.0 (2021-12-10) | ||
|
||
- Added support for additional pdf to text extraction engines: | ||
- [pytesseract](https://pypi.org/project/pytesseract/) | ||
- [textract](https://textract.readthedocs.io/en/stable/index.html) | ||
- Added the ability to bring your own pdf to text extraction engine. | ||
- Added new spacy extension attributes and methods: | ||
- `doc._.page_range` | ||
- `doc._.first_page` | ||
- `doc._.last_page` | ||
- `doc._.pdf_file_name` | ||
- `doc._.page(int)` | ||
- Built a new documentation site: [https://samedwardes.github.io/spaCyPDFreader/](https://samedwardes.github.io/spaCyPDFreader/) | ||
|
||
## 0.1.1 (2021-12-10) | ||
|
||
- 0.1.1 Python ^3.7 support by @SamEdwardes in [https://github.com/SamEdwardes/spaCyPDFreader/pull/2](https://github.com/SamEdwardes/spaCyPDFreader/pull/2) | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,25 @@ | ||
# Contributing | ||
|
||
## Code style | ||
|
||
The black code formatter should be run against all code. | ||
|
||
```bash | ||
black spacypdfreader | ||
``` | ||
|
||
## Documentation | ||
|
||
Documentation is built using [Material for mkdocs](https://squidfunk.github.io/mkdocs-material/). All of the documentations lives within the `docs/` directory. | ||
|
||
### Test the docs locally | ||
|
||
To test the docs locally run the following command: | ||
|
||
```bash | ||
mkdocs serve | ||
``` | ||
|
||
### Publish the docs | ||
|
||
The docs are hosted on using GitHub pages at [https://samedwardes.github.io/spaCyPDFreader/contributing/](https://samedwardes.github.io/spaCyPDFreader/contributing/). Every commit or pull request against the main branch will trigger a build. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
import shutil | ||
|
||
def copy_readme(*args, **kwargs): | ||
shutil.copy("README.md", "docs/index.md") |
Oops, something went wrong.