Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TypeError raised by extract_text method with compressed PDF file #886

Open
jbpenrath opened this issue May 22, 2023 · 2 comments · May be fixed by #1029
Open

TypeError raised by extract_text method with compressed PDF file #886

jbpenrath opened this issue May 22, 2023 · 2 comments · May be fixed by #1029

Comments

@jbpenrath
Copy link

jbpenrath commented May 22, 2023

Bug report

Description

I'm generating PDF document through Weasyprint. Since the version 59.0 of this package, I'm not able to extract text from generated compressed PDF files with pdfminer.highlevel.extract_text method. Indeed this method raises a TypeError, invalid length. The exception is raised from a util method called nunpack.

So I first open an issue on the Weasyprint repository, but it appears the issue's source could be come from pdfminer itself.

You can take a look to the answer of Weasyprint maintainer, to understand pdfminer concern in this problem.

Steps to reproduce

from io import BytesIO
from pdfminer.high_level import extract_text
from weasyprint import HTML

html = HTML(string='<h1>Hello world</h1>')
document = html.write_pdf()
extract_text(BytesIO(document)) # 💥 TypeError: invalid length: 6
@jbpenrath jbpenrath changed the title TypeError raised by `extract_text TypeError raised by extract_text method with compressed PDF file May 22, 2023
@liZe
Copy link

liZe commented May 23, 2023

Here’s a simple and uncompressed PDF to reproduce the problem, in case you’d like to avoid installing another tool 😄:
hello.pdf

The error is caused by the XRef table with /W [1 4 6]. The third field is encoded using 6 bytes, and it’s decoded here using nunpack that’s not designed to handle all integer sizes.

Instead of using struct.unpack in nunpack, it may be useful to use int.from_bytes that will automatically work for all integer sizes.

@dhdaines
Copy link
Contributor

dhdaines commented Aug 1, 2024

fixed in #1029 (and thank you for weasyprint, it is very nice software!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants