Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weasyprint 59.0 incompatiliby with pdfminer.extract_text #1885

Closed
jbpenrath opened this issue May 22, 2023 · 3 comments
Closed

Weasyprint 59.0 incompatiliby with pdfminer.extract_text #1885

jbpenrath opened this issue May 22, 2023 · 3 comments

Comments

@jbpenrath
Copy link

jbpenrath commented May 22, 2023

Bug Report

Problematic Behavior
Since Weasyprint 59.0, providing a generated file to pdfminer.high_level.extract_text raises a TypeError.
The exception is raised from a util method called nunpack.
e.g:

TypeError: invalid length: 5

P.S: I don't really know if the problem should be fix by Weasyprint or pdfminer but as this issue has been introduced by Weasyprint 59.0, I post this issue in this repository first.

Expected behavior/code
I should be able to use the extract_text method with a PDF generated through Weasyprint 59.0.

Steps to Reproduce

  1. Run the following script :
from io import BytesIO
from pdfminer.high_level import extract_text
from weasyprint import HTML

html = HTML(string='<h1>Hello world</h1>')
document = html.write_pdf()
extract_text(BytesIO(document)) # 💥 TypeError: invalid length: 6

Use html.write_pdf(uncompressed_pdf=True) currently fix the problem...

Environment

  • Platform: Docker - python:3.8-slim (Debian 11.7)
  • Python version : 3.8.16
  • Weasyprint version: Weasyprint 59.0

Possible Solution

  1. Disable PDF compression that is weird
  2. Rollback to Weasyprint 58.1 that is also weird
@liZe
Copy link
Member

liZe commented May 22, 2023

Hi!

Thanks for this detailed bug report.

It may be a problem in the PDFs we generate, but it’s unlikely as none of the PDF validators, converters and readers we’ve tried complain about these PDFs with object streams (and we’ve tried a lot 😄).

The problem is more probably caused by a bug in pdfminer in the way object streams are handled. Using write_pdf(pdf_version='1.4') works, and that’s the main (only?) difference between PDF 1.4 and 1.5 in the way we generate documents.

From what I understand from the pdfminer’s code, the problem is here:
https://github.com/pdfminer/pdfminer.six/blob/5114acdda61205009221ce4ebf2c68c144fc4ee5/pdfminer/pdfdocument.py#L332-L334

nunpack seems to be designed to handle C-like integers (stored on 1, 2, 4 and 8 bytes), but the number of bytes allowed in the W value of XRef stream dictionaries is not limited to these numbers (as per Table 17 in section 7.5.8.2 of the PDF 2.0 specification).

The fix is either to:

  • make nunpack allow all numbers of bytes,
  • use something different than nunpack.

A naive fix (that works for this case but may not for other cases, pdfminer’s developers will know better than me) is to change this:
https://github.com/pdfminer/pdfminer.six/blob/5114acdda61205009221ce4ebf2c68c144fc4ee5/pdfminer/utils.py#L352-L363
into this:

return int.from_bytes(s, byteorder='big')

@jbpenrath
Copy link
Author

Many thanks for your reply @liZe. I opened an issue to pdfminer, let's see their point of view.

@liZe
Copy link
Member

liZe commented Jul 12, 2023

Let’s continue the discussion in pdfminer/pdfminer.six#886, there’s nothing more we can do here!

@liZe liZe closed this as not planned Won't fix, can't repro, duplicate, stale Jul 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants