Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Large file size increases due to PDF/A font substitution #1369

Open
ferdiga opened this issue Aug 3, 2024 · 9 comments
Open

[Bug]: Large file size increases due to PDF/A font substitution #1369

ferdiga opened this issue Aug 3, 2024 · 9 comments
Assignees
Labels

Comments

@ferdiga
Copy link

ferdiga commented Aug 3, 2024

Describe the bug

In my case the creation of a PdfA increased the size by a multiple of 500 !!!

  • IMO I identified the culprit: gs can not handle mixed portrait and landscape well.
    After separating portrait and landscape files in 2 separate files, ocrmypdf performed extremely well and reduced the file size of each file. THIS WAS TRUE FOR ONE SET OF TYPICAL FILES - BUT NOT FOR OTHERS

  • A solution could be to split each file into single page files, run ocrmypdf (and hence gs) on each and put these together again? - DOES NOT SOLVE THE PROBLEM

Steps to reproduce

1. Run ocrmypdf -v --skip-text input.pdf output.pdf
BTW I tried many other parameters - output all about the same size
gs took minutes to create the multi MB files.

Files

here the json representation using qpdf --json <>
Monatsbericht zum 30.06.2023-json.pdf
Monatsbericht zum 30.06.2023-ocr-json.pdf

the log file
ocrmypdf.log
encrypted original file
test.zip

shows the size after ocrmypdf
20240803 100232 ocr_test

How did you download and install the software?

Homebrew

OCRmyPDF version

16.4.2

Relevant log output

No response

@ferdiga ferdiga added the triage Issue needs triage label Aug 3, 2024
@ferdiga
Copy link
Author

ferdiga commented Aug 3, 2024

BTW running ocrmypdf --force-ocr on the big (corrupted ) file reduced the size again significantly and OCR was available again

@jbarlow83
Copy link
Collaborator

jbarlow83 commented Aug 5, 2024

I ran your test file and ended up with an acceptable file size increase of 15.5% instead of the dramatic increase you reported (66k -> 76k). I have the same dependency versions you do, although it looks you are using macOS and I have using Linux + Homebrew which still should be very close.

Would you mind running ocrmypdf -k -v --skip-text input.pdf output.pdf, zipping/encrypting the generating temporary folder, and upload it here? Alternately, if you can examine the temporary folder and identify which file "blew up" in file size, that might help me figure out what happened.

@ferdiga
Copy link
Author

ferdiga commented Aug 5, 2024

Darwin Ferdi-MacBook-Air.local 23.5.0 Darwin
Kernel Version 23.5.0: Wed May 1 20:16:51 PDT 2024;
root:xnu-10063.121.3~5/RELEASE_ARM64_T8103 arm64

gs --version
10.03.1

blow up: pdfa.ps (9KB) > pdfa.pdf (32.8MB)
whereas the pdfa.ps seems to small to include all information.

20240805 111738 ocrmypdf io 1vjrrij4

the debug.log
debug.log

and YES, CLI on Macbook is not trivial.

@gamer191
Copy link

@ferdiga I obviously don't have access to the file, since it's encrypted. Regardless, may I suggest attempting to load pdfa.ps in aeroplane mode? If it's 9KB, I suspect it contains external assets (I don't know if that's possible for .ps files, but technically it could be a .pdf file with the wrong extension), which I think OCRMyPDF embeds if you use PDF/A

@jbarlow83
Copy link
Collaborator

pdfa.ps just provides a sRGB ICC profile and a little metadata which Ghostscript requires and does not provide for PDF/A conversion. It's not an issue.

@jbarlow83 jbarlow83 added need test file and removed triage Issue needs triage labels Aug 14, 2024
@jbarlow83
Copy link
Collaborator

I confirmed that the issue is reproducible on macOS but not Linux. It's almost certainly a Ghostscript issue at this point -- some discrepancy between their Linux and macOS versions.

@jbarlow83 jbarlow83 added third party issue Problem with a third party dependency and removed need test file labels Aug 14, 2024
@jbarlow83
Copy link
Collaborator

@ferdiga Do I have permission to share your encrypted with Artifex (Ghostscript's maintainer), or do you want to report the issue to them yourself?

Ghostscript on macOS causes a dramatic increase in file size, while Linux does not, both using 10.3.1, for your input file, with a command line like:

['gs', '-dBATCH', '-dNOPAUSE', '-dSAFER', '-dCompatibilityLevel=1.6', '-sDEVICE=pdfwrite', '-dAutoRotatePages=/None', '-sColorConversionStrategy=LeaveColorUnchanged', '-dPDFSTOPONERROR', '-dAutoFilterColorImages=true', '-dAutoFilterGrayImages=true', '-dJPEGQ=95', '-dPDFA=2', '-dPDFACompatibilityPolicy=1', '-o', '-', '-sstdout=%stderr', '/var/folders/b7/yf4jd29d4qg_nr4cxmh__yrh0000gn/T/ocrmypdf.io.1vjrrij4/pdfa.ps', '/var/folders/b7/yf4jd29d4qg_nr4cxmh__yrh0000gn/T/ocrmypdf.io.1vjrrij4/fix_docinfo.pdf']

Provide pdfa.ps and fix_docinfo.pdf as generated by OCRmyPDF using the -k option.

@ferdiga
Copy link
Author

ferdiga commented Aug 14, 2024

@ferdiga Do I have permission to share your encrypted with Artifex (Ghostscript's maintainer), or do you want to report the issue to them yourself?

Yes, please

@jbarlow83
Copy link
Collaborator

jbarlow83 commented Aug 14, 2024

I figured out what was going on. Ghostscript is working correctly under the circumstances - no need to report to Artifex.

The provided PDF does not include all of its fonts and PDF/A conversion requires all fonts to be provided (PDF/A must be fully self-describing). On macOS, this means ~35 MB of fonts get inserted into the file; on Linux and elsewhere, the fonts that happen to be picked for substitution are smaller. (The macOS fonts don't look any better, fwiw. I imagine it has more to do with bundling large Unicode character sets.)

Ghostscript has an option -dNONATIVEFONTMAP which causes Ghostscript to use its font library for font substitution, which means that the result would be consistent in file size and presentation (it will always substitute the same font), but this will reduce rendering quality when the system font is available for an exact rendering.

A further complication is that font substitution is something that I'd rather have OCRmyPDF doing automatically since it can permanently alters the presentation to an inferior rendering (as it does with the provided file, on both platforms).

I'll have to think about this issue some more but I suspect I will end up making "no font substitution" the default behavior (we abort if substitution happens) because users really should do a visual comparison of input/output, or try to install the missing font rather than rely on substitution, etc. Then I can add an option to switch on font substitution.

@jbarlow83 jbarlow83 added bug and removed third party issue Problem with a third party dependency labels Aug 14, 2024
@jbarlow83 jbarlow83 changed the title [Bug]: ocrmypdf increased the size by a multiple of 500 and no OCR in the result file, even the existing was lost. [Bug]: Large file size increases due to PDF/A font substitution Aug 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants