Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make sure to dereference MediaBox in /Pages #1027

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
- `RecursionError` when corrupt PDF specifies a recursive /Pages object ([#998](https://github.com/pdfminer/pdfminer.six/pull/998))
- `TypeError` when corrupt PDF specifies text-positioning operators with invalid values ([#1000](https://github.com/pdfminer/pdfminer.six/pull/1000))
- inline image parsing fails when stream data contains "EI\n" ([#1008](https://github.com/pdfminer/pdfminer.six/issues/1008))
- `TypeError` when MediaBox is an indirect object reference ([#1004](https://github.com/pdfminer/pdfminer.six/issues/1004))

### Removed

Expand Down
29 changes: 18 additions & 11 deletions pdfminer/pdfpage.py
Original file line number Diff line number Diff line change
Expand Up @@ -67,27 +67,34 @@ def __init__(
self.resources: Dict[object, object] = resolve1(
self.attrs.get("Resources", dict()),
)
mediabox_params: List[Any] = [
resolve1(mediabox_param) for mediabox_param in self.attrs["MediaBox"]
]
self.mediabox = parse_rect(resolve1(mediabox_params))
if "MediaBox" in self.attrs:
self.mediabox = parse_rect(
resolve1(val) for val in resolve1(self.attrs["MediaBox"])
)
else:
log.warning(
"MediaBox missing from /Page (and not inherited),"
" defaulting to US Letter (612x792)"
)
self.mediabox = (0, 0, 612, 792)
self.cropbox = self.mediabox
if "CropBox" in self.attrs:
try:
self.cropbox = parse_rect(resolve1(self.attrs["CropBox"]))
self.cropbox = parse_rect(
resolve1(val) for val in resolve1(self.attrs["CropBox"])
)
except PDFValueError:
pass
log.warning("Invalid CropBox in /Page, defaulting to MediaBox")

self.rotate = (int_value(self.attrs.get("Rotate", 0)) + 360) % 360
self.annots = self.attrs.get("Annots")
self.beads = self.attrs.get("B")
if "Contents" in self.attrs:
contents = resolve1(self.attrs["Contents"])
self.contents: List[object] = resolve1(self.attrs["Contents"])
if not isinstance(self.contents, list):
self.contents = [self.contents]
else:
contents = []
if not isinstance(contents, list):
contents = [contents]
self.contents: List[object] = contents
self.contents = []

def __repr__(self) -> str:
return f"<PDFPage: Resources={self.resources!r}, MediaBox={self.mediabox!r}>"
Expand Down
Binary file added samples/contrib/issue-1004-indirect-mediabox.pdf
Binary file not shown.
4 changes: 4 additions & 0 deletions tests/test_tools_pdf2txt.py
Original file line number Diff line number Diff line change
Expand Up @@ -118,6 +118,10 @@ def test_html_simple1(self):
def test_hocr_simple1(self):
run("simple1.pdf", "-t hocr")

def test_contrib_issue_1004_mediabox(self):
"""Verify that we do not crash with MediaBox is an object reference"""
run("contrib/issue-1004-indirect-mediabox.pdf")


class TestDumpImages:
@staticmethod
Expand Down
Loading