Set forgiving encoding fallback when parsing Po file #515

aleasto · 2024-08-20T13:10:27Z

Before reading the expected encoding from the file itself, we first try to read it as UTF-8 so we can at least get to read the line that specifies the encoding.

For some reason, $PerlIO::encoding::fallback is set to FB_CROAK when we reach this part of the code so the program would instead exit at the first unknown character.

aleasto · 2024-08-20T13:13:20Z

~~I believe this is issue #495~~ errr maybe not

But i found it while building debian's pppconfig instead, which has an iso-8859-1 po file.
buildlog_ubuntu-oracular-amd64.pppconfig_2.3.28_BUILDING.txt.gz

aleasto · 2024-08-20T14:22:32Z

FB_DEFAULT also gets rid of any warnings.

Printing warnings on the first pass, in case the encoding was not UTF-8, is a bit useless. However on the second pass, with the encoding parsed from the file itself it may be useful to print out some warnings (or even error-out). Thoughts?

mquinson · 2024-08-27T11:52:42Z

Hello,

thanks for your contribution. I would indeed like to display the errors on problem. Also, I think we should add a test enforcing that we do the right thing in case of problems. I will try to do the test, but I would appreciate if you could make sure that the encoding errors are still displayed on need, please.

Thanks,

aleasto · 2024-09-02T13:13:31Z

This revision will print warnings on the second pass only, when we request a specific charset.

It also means that we will run a second pass with the same UTF-8 encoding in case the file did not specify charset= because on the first pass a mismatch of encoding would have been silent otherwise. It's a small price to pay to keep the same logic structure, I don' know if you'd consider it acceptable.

mquinson · 2024-09-02T15:06:11Z

Hello @aleasto, thanks for your contribution. Could you please add a little test example to your PR? You don't need to integrate it to the testing infrastructure (even if it'd be great if you could), but a source document and the corresponding config file would be necessary to ensure that this feature keeps working as expected in the future.

Thanks in advance,
Mt

Before reading the expected encoding from the file itself, we first try to read it as UTF-8 so we can at least get to read the line that specifies the encoding. For some reason, $PerlIO::encoding::fallback is set to FB_CROAK when we reach this part of the code so the program would instead exit at the first unknown character. Unless we were requested a specific encoding, do not even print any warning about it.

aleasto · 2024-09-03T10:40:15Z

I couldn't quite figure out how to integrate the test, but you can verify that

cd t/charset/implicit-iso8859;
po4a po4a.conf --force

fails without the first commit.

gemmaro · 2024-09-11T11:34:28Z

The test seems to fail at the moment.
perl Build.PL to generate the Build file, and ./Build test --test_files=t/charset.t to run the added tests (or without the option for full cases.)

See also https://metacpan.org/pod/Module::Build#test.

aleasto · 2024-09-11T11:40:45Z

Yea I'm aware.
I don't have any experience with both gettext and perl, so I'd appreciate help in writing a proper test based on this experiment, that fails in master and is meant to be fixed in this PR:

cd t/charset/implicit-iso8859;
po4a po4a.conf --force

mquinson · 2024-09-11T19:34:49Z

I'll integrate the test on it at some point in the future, but right now I do not have the time to work on it. Any help would be welcome, of course. Sorry for the delay

mquinson · 2024-09-15T16:06:55Z

If I'm seeing correctly, the test is failing because the produced file is still in UTF-8 and not in Latin-1 as advised by the PO file.

@aleasto, I'm a bit lost. What's the purpose of that test, actually? I don't see any Latin-1 char in the t/charset/implicit-iso8859/iso8859.pod input file. I would have thought that the implemented feature would be better tested with such a char?

aleasto · 2024-09-15T16:12:31Z

implicit-iso8859/iso8859.nb.po uses \xA0 (non-breaking space, thus the "nb" locale id 😛). It's visible in some editors, it might be invisible in yours.

mquinson · 2024-09-22T23:06:32Z

That's a weird local name, in this case, because it already exists: https://www.localeplanet.com/icu/nb/index.html

I'd prefer a visible char, such as é or à, please.

Then, I had the feeling that the only broken part of your test is that the produced translation file is in UTF8 despite the fact that the input was in another encoding. I'm not sure whether that's the wanted behavior, but if so, the test should be further fixed.

Thanks a lot for your help, and sorry for being distant these days, real life is really packed.

aleasto · 2024-09-25T08:00:21Z

I had the feeling that the only broken part of your test is that the produced translation file is in UTF8 despite the fact that the input was in another encoding. I'm not sure whether that's the wanted behavior, but if so, the test should be further fixed.

I'm not sure either. I'm replicating the setup of pppconfig assuming it is correct as it worked without complains before po4a 0.70.
How do you advise the test be fixed? And should pppconfig also be fixed? (pardon my ignorance around gettext 🫣)

Fat-Zer · 2024-10-28T08:27:43Z

Sorry if I'm interjecting, but why not to fix the underlying issue (that po4a tries to read the whole file in a wrong encoding) in the first place? i.e. read only up to the first msgid/msgstr and only then imbue the file with a correct encoding.

Besides that it feels like a setting to use relaxed attitude towards encoding should be global rather than local. And if it's local it should be reset after the function (i.e. set $PerlIO::encoding::fallback to a previous value).

mquinson · 2024-11-05T07:22:58Z

I think that this PR is superseeded by !538, which fixes the underlying problem. @aleasto could you please confirm that the current git version fixes the problem you had in the first place?

Thanks, Mt

aleasto · 2024-11-05T08:07:44Z

Seems to work for the (manual) test case in this PR.
Thanks.

aleasto force-pushed the encoding branch 2 times, most recently from b00fa0b to 6caa868 Compare September 2, 2024 13:10

aleasto force-pushed the encoding branch from 6caa868 to 3ac43c8 Compare September 3, 2024 10:35

Add test case for charset=iso-8859-1

9f89cdd

aleasto force-pushed the encoding branch from 3ac43c8 to 9f89cdd Compare September 25, 2024 07:55

Fat-Zer mentioned this pull request Oct 29, 2024

Better accommodations for reading non-UTF-8 PO files #538

Merged

mquinson added the more info label Nov 5, 2024

aleasto closed this Nov 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set forgiving encoding fallback when parsing Po file #515

Set forgiving encoding fallback when parsing Po file #515

aleasto commented Aug 20, 2024

aleasto commented Aug 20, 2024 •

edited

Loading

aleasto commented Aug 20, 2024

mquinson commented Aug 27, 2024

aleasto commented Sep 2, 2024

mquinson commented Sep 2, 2024

aleasto commented Sep 3, 2024

gemmaro commented Sep 11, 2024

aleasto commented Sep 11, 2024

mquinson commented Sep 11, 2024

mquinson commented Sep 15, 2024

aleasto commented Sep 15, 2024

mquinson commented Sep 22, 2024

aleasto commented Sep 25, 2024

Fat-Zer commented Oct 28, 2024

mquinson commented Nov 5, 2024

aleasto commented Nov 5, 2024

Set forgiving encoding fallback when parsing Po file #515

Set forgiving encoding fallback when parsing Po file #515

Conversation

aleasto commented Aug 20, 2024

aleasto commented Aug 20, 2024 • edited Loading

aleasto commented Aug 20, 2024

mquinson commented Aug 27, 2024

aleasto commented Sep 2, 2024

mquinson commented Sep 2, 2024

aleasto commented Sep 3, 2024

gemmaro commented Sep 11, 2024

aleasto commented Sep 11, 2024

mquinson commented Sep 11, 2024

mquinson commented Sep 15, 2024

aleasto commented Sep 15, 2024

mquinson commented Sep 22, 2024

aleasto commented Sep 25, 2024

Fat-Zer commented Oct 28, 2024

mquinson commented Nov 5, 2024

aleasto commented Nov 5, 2024

aleasto commented Aug 20, 2024 •

edited

Loading