Weasyprint 59.0 incompatiliby with `pdfminer.extract_text` #1885

jbpenrath · 2023-05-22T09:10:23Z

Bug Report

Problematic Behavior
Since Weasyprint 59.0, providing a generated file to pdfminer.high_level.extract_text raises a TypeError.
The exception is raised from a util method called nunpack.
e.g:

TypeError: invalid length: 5

P.S: I don't really know if the problem should be fix by Weasyprint or pdfminer but as this issue has been introduced by Weasyprint 59.0, I post this issue in this repository first.

Expected behavior/code
I should be able to use the extract_text method with a PDF generated through Weasyprint 59.0.

Steps to Reproduce

Run the following script :

from io import BytesIO
from pdfminer.high_level import extract_text
from weasyprint import HTML

html = HTML(string='<h1>Hello world</h1>')
document = html.write_pdf()
extract_text(BytesIO(document)) # 💥 TypeError: invalid length: 6

Use html.write_pdf(uncompressed_pdf=True) currently fix the problem...

Environment

Platform: Docker - python:3.8-slim (Debian 11.7)
Python version : 3.8.16
Weasyprint version: Weasyprint 59.0

Possible Solution

Disable PDF compression that is weird
Rollback to Weasyprint 58.1 that is also weird

The text was updated successfully, but these errors were encountered:

liZe · 2023-05-22T10:07:54Z

Hi!

Thanks for this detailed bug report.

It may be a problem in the PDFs we generate, but it’s unlikely as none of the PDF validators, converters and readers we’ve tried complain about these PDFs with object streams (and we’ve tried a lot 😄).

The problem is more probably caused by a bug in pdfminer in the way object streams are handled. Using write_pdf(pdf_version='1.4') works, and that’s the main (only?) difference between PDF 1.4 and 1.5 in the way we generate documents.

From what I understand from the pdfminer’s code, the problem is here:
https://github.com/pdfminer/pdfminer.six/blob/5114acdda61205009221ce4ebf2c68c144fc4ee5/pdfminer/pdfdocument.py#L332-L334

nunpack seems to be designed to handle C-like integers (stored on 1, 2, 4 and 8 bytes), but the number of bytes allowed in the W value of XRef stream dictionaries is not limited to these numbers (as per Table 17 in section 7.5.8.2 of the PDF 2.0 specification).

The fix is either to:

make nunpack allow all numbers of bytes,
use something different than nunpack.

A naive fix (that works for this case but may not for other cases, pdfminer’s developers will know better than me) is to change this:
https://github.com/pdfminer/pdfminer.six/blob/5114acdda61205009221ce4ebf2c68c144fc4ee5/pdfminer/utils.py#L352-L363
into this:

return int.from_bytes(s, byteorder='big')

jbpenrath · 2023-05-23T14:51:48Z

Many thanks for your reply @liZe. I opened an issue to pdfminer, let's see their point of view.

liZe · 2023-07-12T08:54:21Z

Let’s continue the discussion in pdfminer/pdfminer.six#886, there’s nothing more we can do here!

This was referenced May 22, 2023

Feat/create document with weasyprint options openfun/marion#163

Merged

TypeError raised by extract_text method with compressed PDF file pdfminer/pdfminer.six#886

Open

liZe closed this as not planned Won't fix, can't repro, duplicate, stale Jul 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Weasyprint 59.0 incompatiliby with `pdfminer.extract_text` #1885

Weasyprint 59.0 incompatiliby with `pdfminer.extract_text` #1885

jbpenrath commented May 22, 2023 •

edited

Loading

liZe commented May 22, 2023

jbpenrath commented May 23, 2023

liZe commented Jul 12, 2023

Weasyprint 59.0 incompatiliby with pdfminer.extract_text #1885

Weasyprint 59.0 incompatiliby with pdfminer.extract_text #1885

Comments

jbpenrath commented May 22, 2023 • edited Loading

Bug Report

liZe commented May 22, 2023

jbpenrath commented May 23, 2023

liZe commented Jul 12, 2023

Weasyprint 59.0 incompatiliby with `pdfminer.extract_text` #1885

Weasyprint 59.0 incompatiliby with `pdfminer.extract_text` #1885

jbpenrath commented May 22, 2023 •

edited

Loading