Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set forgiving encoding fallback when parsing Po file #515

Closed
wants to merge 2 commits into from

Conversation

aleasto
Copy link

@aleasto aleasto commented Aug 20, 2024

Before reading the expected encoding from the file itself, we first try to read it as UTF-8 so we can at least get to read the line that specifies the encoding.

For some reason, $PerlIO::encoding::fallback is set to FB_CROAK when we reach this part of the code so the program would instead exit at the first unknown character.

@aleasto
Copy link
Author

aleasto commented Aug 20, 2024

I believe this is issue #495 errr maybe not

But i found it while building debian's pppconfig instead, which has an iso-8859-1 po file.
buildlog_ubuntu-oracular-amd64.pppconfig_2.3.28_BUILDING.txt.gz

@aleasto
Copy link
Author

aleasto commented Aug 20, 2024

FB_DEFAULT also gets rid of any warnings.

Printing warnings on the first pass, in case the encoding was not UTF-8, is a bit useless. However on the second pass, with the encoding parsed from the file itself it may be useful to print out some warnings (or even error-out). Thoughts?

@mquinson
Copy link
Owner

Hello,

thanks for your contribution. I would indeed like to display the errors on problem. Also, I think we should add a test enforcing that we do the right thing in case of problems. I will try to do the test, but I would appreciate if you could make sure that the encoding errors are still displayed on need, please.

Thanks,

@aleasto aleasto force-pushed the encoding branch 2 times, most recently from b00fa0b to 6caa868 Compare September 2, 2024 13:10
@aleasto
Copy link
Author

aleasto commented Sep 2, 2024

This revision will print warnings on the second pass only, when we request a specific charset.

It also means that we will run a second pass with the same UTF-8 encoding in case the file did not specify charset= because on the first pass a mismatch of encoding would have been silent otherwise. It's a small price to pay to keep the same logic structure, I don' know if you'd consider it acceptable.

@mquinson
Copy link
Owner

mquinson commented Sep 2, 2024

Hello @aleasto, thanks for your contribution. Could you please add a little test example to your PR? You don't need to integrate it to the testing infrastructure (even if it'd be great if you could), but a source document and the corresponding config file would be necessary to ensure that this feature keeps working as expected in the future.

Thanks in advance,
Mt

Before reading the expected encoding from the file itself, we first
try to read it as UTF-8 so we can at least get to read the line that
specifies the encoding.

For some reason, $PerlIO::encoding::fallback is set to FB_CROAK when
we reach this part of the code so the program would instead exit
at the first unknown character.

Unless we were requested a specific encoding, do not even print any
warning about it.
@aleasto
Copy link
Author

aleasto commented Sep 3, 2024

I couldn't quite figure out how to integrate the test, but you can verify that

cd t/charset/implicit-iso8859;
po4a po4a.conf --force

fails without the first commit.

@gemmaro
Copy link
Contributor

gemmaro commented Sep 11, 2024

The test seems to fail at the moment.
perl Build.PL to generate the Build file, and ./Build test --test_files=t/charset.t to run the added tests (or without the option for full cases.)

See also https://metacpan.org/pod/Module::Build#test.

@aleasto
Copy link
Author

aleasto commented Sep 11, 2024

Yea I'm aware.
I don't have any experience with both gettext and perl, so I'd appreciate help in writing a proper test based on this experiment, that fails in master and is meant to be fixed in this PR:

cd t/charset/implicit-iso8859;
po4a po4a.conf --force

@mquinson
Copy link
Owner

I'll integrate the test on it at some point in the future, but right now I do not have the time to work on it. Any help would be welcome, of course. Sorry for the delay

@mquinson
Copy link
Owner

If I'm seeing correctly, the test is failing because the produced file is still in UTF-8 and not in Latin-1 as advised by the PO file.

@aleasto, I'm a bit lost. What's the purpose of that test, actually? I don't see any Latin-1 char in the t/charset/implicit-iso8859/iso8859.pod input file. I would have thought that the implemented feature would be better tested with such a char?

@aleasto
Copy link
Author

aleasto commented Sep 15, 2024

implicit-iso8859/iso8859.nb.po uses \xA0 (non-breaking space, thus the "nb" locale id 😛). It's visible in some editors, it might be invisible in yours.

@mquinson
Copy link
Owner

That's a weird local name, in this case, because it already exists: https://www.localeplanet.com/icu/nb/index.html

I'd prefer a visible char, such as é or à, please.

Then, I had the feeling that the only broken part of your test is that the produced translation file is in UTF8 despite the fact that the input was in another encoding. I'm not sure whether that's the wanted behavior, but if so, the test should be further fixed.

Thanks a lot for your help, and sorry for being distant these days, real life is really packed.

@aleasto
Copy link
Author

aleasto commented Sep 25, 2024

I had the feeling that the only broken part of your test is that the produced translation file is in UTF8 despite the fact that the input was in another encoding. I'm not sure whether that's the wanted behavior, but if so, the test should be further fixed.

I'm not sure either. I'm replicating the setup of pppconfig assuming it is correct as it worked without complains before po4a 0.70.
How do you advise the test be fixed? And should pppconfig also be fixed? (pardon my ignorance around gettext 🫣)

@Fat-Zer
Copy link
Contributor

Fat-Zer commented Oct 28, 2024

Sorry if I'm interjecting, but why not to fix the underlying issue (that po4a tries to read the whole file in a wrong encoding) in the first place? i.e. read only up to the first msgid/msgstr and only then imbue the file with a correct encoding.

Besides that it feels like a setting to use relaxed attitude towards encoding should be global rather than local. And if it's local it should be reset after the function (i.e. set $PerlIO::encoding::fallback to a previous value).

@mquinson
Copy link
Owner

mquinson commented Nov 5, 2024

I think that this PR is superseeded by !538, which fixes the underlying problem. @aleasto could you please confirm that the current git version fixes the problem you had in the first place?

Thanks, Mt

@aleasto
Copy link
Author

aleasto commented Nov 5, 2024

Seems to work for the (manual) test case in this PR.
Thanks.

@aleasto aleasto closed this Nov 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants