-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Asciidoc / Keeping original formating #416
Comments
Hello I think that there are several points mixed together in this issue. Let me list as I see them:
|
I have no problem with the segmentation to paragraph. That's fine and that's exactly how I need it and it should be. I only want that the original text layout is preserved inside the message id like this:
and not wrapped like this:
I don't want anything more! In fact, index terms are very difficult to work with. Personally, I only use them at the beginning or end of a sentence and separate them in a different line. Now I have to go back a bit about how I use deepl. When I let deepl translate a message ID, I convert it to XML. This means that I replace, for example, formatting such as "bold" or "italic" as well as links or references with an XML element. I also wrap index terms with an XML element. In the REST request, I can configure which XML elements should be translated and which should not. At the moment, I'm translating index terms separately and marking the XML element as "don't translate". However, this only works properly to a limited extent, since deepl interprets "don't translate" XML elements as part of a sentence and thus the sentence may be distorted after translation. The deepl REST-API offers the parameter "split_sentences" for XML content. Currently, I have to use the value "nonewlines" (= splits on punctuation only, ignoring newlines) but I prefer to use the value "1" ( = splits on punctuation and on newlines) (also see: https://www.deepl.com/docs-api/translate-text/translate-text/). If you preserve the formatting inside the message id, I can handle the index term as a separate sentence and the following sentence starting with "Lorem ipsum dolor" would be translated correctly and won't be mixed up with the index term. That's all I want. Till now, I have no experience with Weblate and I don't know if Weblate has similar options to the deepl API. Therefore, I cannot give you a more generic rule, yet. If you know a webpage where these options are described for Weblate, I'm interested in, and I will then think about a generic rule. I hope you can understand my wish better. If something is still unclear, just ask again. Side note: In fact, is deepl not very good on translating single sentences. Therefore, I always prepend a sentence to set the context. That improves the translation a lot, but also consumes my contingent of translatable characters. I have proposed deepl to add a context parameter to their API to add some context tags or references. |
The thing is that the two versions of the po file that you presented are totally identical from the semantic point of view in the po file format. From what I understand, you are processing the po file with a dedicated parser that adds requirements to the po file specification. The formatting of the po file is not managed from the source format handler (such as the asciidoc parser): all the format handlers call methods from the po file object and it is up to the po file object to write the output file according its own rules. So, to circumvent this, we would have to add a way for the format handler to impose some writing rules to the po file object, not a simple change I presume. I still think that your issue can be solved by another way, because you are imposing a solution to a problem that does not require it. I mean that your main problem is the presence of the index term in the segment of the paragraph. If I was to try something concerning your issue, I would just remove all the index entries of all the segments and put them in their own segment. This would have two positive outcomes:
What do you think? |
Yes, that's right, but I wouldn't call at least 11 regular expression a parser. But that's not relevant ;-) I like your approach to solve the problem in the asciidoc module, since the problem only affects asciidoc. I see the same advantages in the solution as you have already described and would prefer your solution. Furthermore, you could then also think about, for example, also mapping table cells each as a separate message to be translated according to the same schema. After a bit of thinking about the index terms, however, I still see two small detail problems:
I hope that my thoughts help you and I am looking forward to your feedback. |
This patch addresses the issue of keeping transparent index entries in paragraphs. It does not address "visible" index entries, which are formatted with ((...)). This addresses GH mquinson#416.
@suddenfall I just pushed a tentative fix for your issue. Can you test it, give a feedback on the feature and point out the failing cases. I'm pretty sure I haven't covered everything. BTW, I'd also be interested in your regexps for transforming inline formats if you can share them. |
This patch addresses the issue of keeping transparent index entries in paragraphs. It does not address "visible" index entries, which are formatted with ((...)). This addresses GH #416.
Hi, I have tested your fix with various patterns. It's works very well. The observed behavior is that index terms are pushed to the beginning of a text segment. I like it. I have also tested my asciidoc project. Unfortunately, I run into an issue. I wrote following in the document:
{ntb} and {nte} are empty variables I use to mark text, that shoudn't be translated (ntb => no translation begin; nte => no translation end). I got following error message:
I have also re-read https://docs.asciidoctor.org/asciidoc/latest/sections/user-index/ and I have learned some new details. I didn't know the difference a flow and concealed index terms. Both types are handled correctly. But I have found 2 special cases, that are also not considered, now:
No problem. It's my first python code, because bash doesn't support po files ;-). The code will meets only my private requirements, is quick and dirty written and will probably not be suitable to other.
Have a nice weekend :-) |
This takes into account: * variables in the index entry * quoted index entries
This takes into account: * variables in the index entry * quoted index entries
This takes into account: * variables in the index entry * quoted index entries
Thank you for your code. Very instructive indeed.
One other surprising thing is that you are not completely concealing the refs, variable names and so on. For instance for variables, I would have more expected some regex in the line: txt = re.sub("{(.*?)}","<v name=\"\g<1>\"/>", txt, 0, re.DOTALL) so that the name of fhe variable is hidden to the translator, thus limiting the chances of errors. |
Today, I have refactored some documents and verified the changes in the translated and generated pdfs. I have also verified your last fix. Overall it was a lot of work. Anyway, everything is fine now! The fix works perfectly.
I have definitely multiple nte blocks per sentence. So far, it works fine to me. I grep the blocks none-greedy using "?". If I misunderstood you, feel free to ask again.
As far I can remember, the "::" is an issue on translating. Deepl fixes this misspelling to ":".
I'm not sure what you mean exactly here. Can you possibly give me an example?
I'm not sure, what you mean. Maybe you mean (explicit) section ids (https://docs.asciidoctor.org/asciidoc/latest/sections/custom-ids/)? If so, I solved the problem this way:
Thanks for the hint. That's exactly what I want. I have fixed my docs today.
Unfortunately, I can't remember exactly what the problem was. It must be a problem with Deepl though. Deepl probably discards "\n" in certain situations, making this workaround necessary to preserve the original formatting.
Yes, that will also work, but requires more characters. You can set the ignored tags on a request to Deepl. Then the ignored tags are not translated. Here is the code snippet, how I call the deepl translator:
I use the lib https://pypi.org/project/deepl/. |
At this point I would like to thank you for your great support. It was a lot fun! |
Delighted to read this!
My mistake, you're right.
I understand that you don't want Deepl to modify the string. What I don't understand is how po4a can produce such strings. Definition lists are supposed to be managed by po4a, which removes the '::" That's why I'd be thankful for an example of such case.
The idea is this: normally, terms for the index which are the same, appear on the same line in the index. For instance:
indicates that this terms are met at pages 34, 56 and 78. But for this to work correctly, the references at these pages need to hold exactly (((Mode,Windows))). Otherwise, the terms won't be sorted under the same entry. This relation must hold homomorphicly during the translation, that is that the translated terms must be the same in their places. This is managed with concealed terms, because now, they are extracted and they will appear in the po file under a single po entry. Thus, they can only be translated to a unique term that will be copied in all the original locations. But for flow terms, the issue is that we are not sure that the translator will stick to the same terms where the identical terms were appearing in the original document. In fact, it could even be impossible: think of German declinations that could change terms that were identical in English. To me, texts that are designed for internationalization should not use flow terms.
OK, that works as I though. The anchors are maintained away from translators. Still, you are using the regex "txt = re.sub("[[(.*?)]]","\g<1>", txt, 0, re.DOTALL)" which hides them. I was expecting po4a to do it for you.
I'm still surprised that you can make Deepl handle carriage returns just like another grammatical block in the sentence.
Ah, that makes sense; I didn't know Deepl was able to make xml nodes as placeholders. In the end, I think a big part of your script could be made a special mode of po4a, if you allow. I'd like to experiment on how other tools may handle these transformations. Thanks for your support, it makes po4a more useable. |
I have found two examples:
and
I agree with you. Concealed index terms should be always preferred to flow index terms. Bad luck, that I have also used the flow terms. Time for another refactoring (sometime).
I think, inline anchors are comparable to flow index terms. Both expressions are used somewhere inside a sentence. The difference is, flow index terms should be translated (and than it's better to use concealed index terms, as you have mentioned) but inline anchors shouldn't be translated. As far as I understand you, you want to hide the inline anchor from the translator, because he hasn't to translate the word. That's a nice idea, but the translator has to decide, where the inline anchor has to be placed inside translated text. I don't think, you can solve this with po4a. Moving the anchors to the beginning of a segment, is in case of my documents not a good idea, because the reference no longer makes sense. Therefore, I would keep it as it is.
Hm, your are right. Maybe, I should add "n" to the ignore tags. But on the other hand, it works currently without trouble and I don't want to change a running system ;-).
Yes, of course. I would be happy if others could benefit from it too. If you like, I can upload the whole scripts as attachments. Then you see my full workflow and that may give you some inspirations.
You're welcome. |
And you found two bugs! I'll generate a fix shortly.
OK. I was about to propose the algorithm that you just rejected. I wonder if an ID really creates an anchor at the place where it appears in the text. Thinking of it, I would expect the ID to be attached to the first enclosing block or the adjacent inline quote, because it has to be attached to an html node. |
I think that all the fixable bugs were fixed in the PR. Thank you @suddenfall again for taking the time to describe the bugs. The xml conversion is on the TODO list. |
|
This takes into account: * variables in the index entry * quoted index entries
I have feature request for the asciidoc module, that may overlap with:
#291
I use Asciidoctor and for me it is also a problem that the content from my master documents are wrapped in the po-files and thus reformatted. I propose to introduce an option in the Asciidoc module so that the original formatting in the po files is preserved.
Motivation
Currently, I use deepl API. Reformatting the content makes it harder to use the API optimally. In addition, reformatting makes it necessary for me to check the translated documents after rendering to PDF whether the original formatting from the master documents has been retained.
I have structured my master documents according to the principle: One line contains exactly one sentence.
That would fit perfectly to deepl API, where a text line is interpret a sentence. Unfortunately, I have not found a way with po4a or the Asciidoc module to prevent the reformatting of the content.
Therefore I have to enable in the deepl API that sentences are recognized by punctuation marks. This works, but leads to problems with index terms, for example. If I write this content:
It results to following message id:
The index term
(((Primus Magnificus)))
is prepend to the content in the same line. This can lead to upper and lower case being mixed up in the translation and even to the entire sentence being distorted in terms of sense.Plea for the feature
Keeping the original formatting from the master documents simplifies the whole process, as you only have to worry about the translation but not also about the formatting in the output, knowing that it will not be changed.
For AI translation tools such as deepl, a maximal line length is not relevant. On the contrary, limiting the line length is even counterproductive.
To ensure backwards compatibility, I suggest that this feature can be activated via an option.
The text was updated successfully, but these errors were encountered: