-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Large file size increases due to PDF/A font substitution #1369
Comments
BTW running ocrmypdf --force-ocr on the big (corrupted ) file reduced the size again significantly and OCR was available again |
I ran your test file and ended up with an acceptable file size increase of 15.5% instead of the dramatic increase you reported (66k -> 76k). I have the same dependency versions you do, although it looks you are using macOS and I have using Linux + Homebrew which still should be very close. Would you mind running |
Darwin Ferdi-MacBook-Air.local 23.5.0 Darwin gs --version blow up: pdfa.ps (9KB) > pdfa.pdf (32.8MB) the debug.log and YES, CLI on Macbook is not trivial. |
@ferdiga I obviously don't have access to the file, since it's encrypted. Regardless, may I suggest attempting to load pdfa.ps in aeroplane mode? If it's 9KB, I suspect it contains external assets (I don't know if that's possible for .ps files, but technically it could be a .pdf file with the wrong extension), which I think OCRMyPDF embeds if you use PDF/A |
pdfa.ps just provides a sRGB ICC profile and a little metadata which Ghostscript requires and does not provide for PDF/A conversion. It's not an issue. |
I confirmed that the issue is reproducible on macOS but not Linux. It's almost certainly a Ghostscript issue at this point -- some discrepancy between their Linux and macOS versions. |
@ferdiga Do I have permission to share your encrypted with Artifex (Ghostscript's maintainer), or do you want to report the issue to them yourself? Ghostscript on macOS causes a dramatic increase in file size, while Linux does not, both using 10.3.1, for your input file, with a command line like:
Provide pdfa.ps and fix_docinfo.pdf as generated by OCRmyPDF using the |
Yes, please |
I figured out what was going on. Ghostscript is working correctly under the circumstances - no need to report to Artifex. The provided PDF does not include all of its fonts and PDF/A conversion requires all fonts to be provided (PDF/A must be fully self-describing). On macOS, this means ~35 MB of fonts get inserted into the file; on Linux and elsewhere, the fonts that happen to be picked for substitution are smaller. (The macOS fonts don't look any better, fwiw. I imagine it has more to do with bundling large Unicode character sets.) Ghostscript has an option A further complication is that font substitution is something that I'd rather have OCRmyPDF doing automatically since it can permanently alters the presentation to an inferior rendering (as it does with the provided file, on both platforms). I'll have to think about this issue some more but I suspect I will end up making "no font substitution" the default behavior (we abort if substitution happens) because users really should do a visual comparison of input/output, or try to install the missing font rather than rely on substitution, etc. Then I can add an option to switch on font substitution. |
Describe the bug
In my case the creation of a PdfA increased the size by a multiple of 500 !!!
IMO I identified the culprit: gs can not handle mixed portrait and landscape well.
After separating portrait and landscape files in 2 separate files, ocrmypdf performed extremely well and reduced the file size of each file. THIS WAS TRUE FOR ONE SET OF TYPICAL FILES - BUT NOT FOR OTHERS
A solution could be to split each file into single page files, run ocrmypdf (and hence gs) on each and put these together again? - DOES NOT SOLVE THE PROBLEM
Steps to reproduce
Files
here the json representation using qpdf --json <>
Monatsbericht zum 30.06.2023-json.pdf
Monatsbericht zum 30.06.2023-ocr-json.pdf
the log file
ocrmypdf.log
encrypted original file
test.zip
shows the size after ocrmypdf
How did you download and install the software?
Homebrew
OCRmyPDF version
16.4.2
Relevant log output
No response
The text was updated successfully, but these errors were encountered: