Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Asciidoc / Keeping original formating #416

Closed
suddenfall opened this issue May 22, 2023 · 15 comments
Closed

Asciidoc / Keeping original formating #416

suddenfall opened this issue May 22, 2023 · 15 comments

Comments

@suddenfall
Copy link
Contributor

suddenfall commented May 22, 2023

I have feature request for the asciidoc module, that may overlap with:
#291

I use Asciidoctor and for me it is also a problem that the content from my master documents are wrapped in the po-files and thus reformatted. I propose to introduce an option in the Asciidoc module so that the original formatting in the po files is preserved.

Motivation

Currently, I use deepl API. Reformatting the content makes it harder to use the API optimally. In addition, reformatting makes it necessary for me to check the translated documents after rendering to PDF whether the original formatting from the master documents has been retained.

I have structured my master documents according to the principle: One line contains exactly one sentence.

That would fit perfectly to deepl API, where a text line is interpret a sentence. Unfortunately, I have not found a way with po4a or the Asciidoc module to prevent the reformatting of the content.

Therefore I have to enable in the deepl API that sentences are recognized by punctuation marks. This works, but leads to problems with index terms, for example. If I write this content:

== Maximus Primus
(((Primus Magnificus)))
Lorem ipsum dolor sit amet, consectetur adipisici elit, sed eiusmod tempor incidunt ut labore et dolore magna aliqua. 
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquid ex ea commodi consequat. 
Quis aute iure reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. 
Excepteur sint obcaecat cupiditat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

It results to following message id:

"(((Primus Magnificus)))  Lorem ipsum dolor sit amet, consectetur adipisici "
"elit, sed eiusmod tempor incidunt ut labore et dolore magna aliqua.  Ut enim "
"ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquid "
"ex ea commodi consequat.  Quis aute iure reprehenderit in voluptate velit "
"esse cillum dolore eu fugiat nulla pariatur.  Excepteur sint obcaecat "
"cupiditat non proident, sunt in culpa qui officia deserunt mollit anim id "
"est laborum."

The index term (((Primus Magnificus))) is prepend to the content in the same line. This can lead to upper and lower case being mixed up in the translation and even to the entire sentence being distorted in terms of sense.

Plea for the feature

Keeping the original formatting from the master documents simplifies the whole process, as you only have to worry about the translation but not also about the formatting in the output, knowing that it will not be changed.

For AI translation tools such as deepl, a maximal line length is not relevant. On the contrary, limiting the line length is even counterproductive.

To ensure backwards compatibility, I suggest that this feature can be activated via an option.

@jnavila
Copy link
Collaborator

jnavila commented May 27, 2023

Hello

I think that there are several points mixed together in this issue. Let me list as I see them:

  • You are complaining that the default segmentation in asciidoc module is the paragraph instead of the sentence. The reason is not clear, because it is mixed with the following points. Is Deepl really wrecking when translating a series of sentences instead of a unique sentence at a time. My experience with Deepl in Weblate is more the opposite, as the paragraph level provides better context for translating. I'd like a clarification on that, if it really makes sense to add an option for sentence level segmentation.
  • You say that the paragraph level of segmentation makes you deal with the formatting of the output. The whole point of the asciidoc module, compared to the basic text module is to know enough of asciidoc's block formatting syntax to extract it and preserve the translators from having to deal with it. The output of translated text is in a "normalized" form, safe for consumption by asciidoc formatters and that should not need being reviewed by translators; if it does not compile to a correct PDF after a faithful translation of all the segments, then this is a bug of po4a.
  • You are complaining that index terms are spread in the segments. I kind of agree that they should be a segment of their own and not pollute other segments, but, because they can appear nearly anywhere, managing their position in translated content is a hard problem. Your use case is special with regards to the general rule. Or maybe, you can propose a generic simplification on this matter.

@suddenfall
Copy link
Contributor Author

I have no problem with the segmentation to paragraph. That's fine and that's exactly how I need it and it should be. I only want that the original text layout is preserved inside the message id like this:

"(((Primus Magnificus)))"
"Lorem ipsum dolor sit amet, consectetur adipisici elit, sed eiusmod tempor incidunt ut labore et dolore magna aliqua. "
"Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquid ex ea commodi consequat. "
"Quis aute iure reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. "
"Excepteur sint obcaecat cupiditat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."

and not wrapped like this:

"(((Primus Magnificus)))  Lorem ipsum dolor sit amet, consectetur adipisici "
"elit, sed eiusmod tempor incidunt ut labore et dolore magna aliqua.  Ut enim "
"ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquid "
"ex ea commodi consequat.  Quis aute iure reprehenderit in voluptate velit "
"esse cillum dolore eu fugiat nulla pariatur.  Excepteur sint obcaecat "
"cupiditat non proident, sunt in culpa qui officia deserunt mollit anim id "
"est laborum."

I don't want anything more!

In fact, index terms are very difficult to work with. Personally, I only use them at the beginning or end of a sentence and separate them in a different line. Now I have to go back a bit about how I use deepl. When I let deepl translate a message ID, I convert it to XML. This means that I replace, for example, formatting such as "bold" or "italic" as well as links or references with an XML element. I also wrap index terms with an XML element. In the REST request, I can configure which XML elements should be translated and which should not.

At the moment, I'm translating index terms separately and marking the XML element as "don't translate". However, this only works properly to a limited extent, since deepl interprets "don't translate" XML elements as part of a sentence and thus the sentence may be distorted after translation.

The deepl REST-API offers the parameter "split_sentences" for XML content. Currently, I have to use the value "nonewlines" (= splits on punctuation only, ignoring newlines) but I prefer to use the value "1" ( = splits on punctuation and on newlines) (also see: https://www.deepl.com/docs-api/translate-text/translate-text/).

If you preserve the formatting inside the message id, I can handle the index term as a separate sentence and the following sentence starting with "Lorem ipsum dolor" would be translated correctly and won't be mixed up with the index term. That's all I want.

Till now, I have no experience with Weblate and I don't know if Weblate has similar options to the deepl API. Therefore, I cannot give you a more generic rule, yet. If you know a webpage where these options are described for Weblate, I'm interested in, and I will then think about a generic rule.

I hope you can understand my wish better. If something is still unclear, just ask again.

Side note: In fact, is deepl not very good on translating single sentences. Therefore, I always prepend a sentence to set the context. That improves the translation a lot, but also consumes my contingent of translatable characters. I have proposed deepl to add a context parameter to their API to add some context tags or references.

@jnavila
Copy link
Collaborator

jnavila commented May 31, 2023

The thing is that the two versions of the po file that you presented are totally identical from the semantic point of view in the po file format. From what I understand, you are processing the po file with a dedicated parser that adds requirements to the po file specification.

The formatting of the po file is not managed from the source format handler (such as the asciidoc parser): all the format handlers call methods from the po file object and it is up to the po file object to write the output file according its own rules. So, to circumvent this, we would have to add a way for the format handler to impose some writing rules to the po file object, not a simple change I presume.

I still think that your issue can be solved by another way, because you are imposing a solution to a problem that does not require it. I mean that your main problem is the presence of the index term in the segment of the paragraph. If I was to try something concerning your issue, I would just remove all the index entries of all the segments and put them in their own segment. This would have two positive outcomes:

  • the paragraph would be free of disturbing chunks of words that make Deepl misinterpret them.
  • the index terms would correctly be tagged in the po file, grouped together and translated at once when they appear at different places in the document. They need to be translated anyway but they need to be kept identical. They should even be split in theirs subterms so that the hierarchy of terms is preserved.

What do you think?

@suddenfall
Copy link
Contributor Author

The thing is that the two versions of the po file that you presented are totally identical from the semantic point of view in the po file format. From what I understand, you are processing the po file with a dedicated parser that adds requirements to the po file specification.

Yes, that's right, but I wouldn't call at least 11 regular expression a parser. But that's not relevant ;-)

I like your approach to solve the problem in the asciidoc module, since the problem only affects asciidoc. I see the same advantages in the solution as you have already described and would prefer your solution.

Furthermore, you could then also think about, for example, also mapping table cells each as a separate message to be translated according to the same schema.

After a bit of thinking about the index terms, however, I still see two small detail problems:

  • How should you handle index terms that are somewhere in the sentence and not at the beginning or end of the sentence. I suspect that this problem is rather academic only. And even if the index term is not at the beginning of the sentence, I haven't (yet) thought of a use case where it would be a problem to move the index term to the beginning of the sentence after translation.

  • Next academic detail problem could be, if several sentences with one index term each are put together to a message to be translated. Is it then better to map the individual sentences as a separate message in order to be able to correctly assign the index term again later?

I hope that my thoughts help you and I am looking forward to your feedback.

jnavila added a commit to jnavila/po4a that referenced this issue Jun 7, 2023
This patch addresses the issue of keeping transparent index entries in
paragraphs. It does not address "visible" index entries, which are
formatted with ((...)).

This addresses GH mquinson#416.
@jnavila
Copy link
Collaborator

jnavila commented Jun 7, 2023

@suddenfall I just pushed a tentative fix for your issue. Can you test it, give a feedback on the feature and point out the failing cases. I'm pretty sure I haven't covered everything.

BTW, I'd also be interested in your regexps for transforming inline formats if you can share them.

mquinson pushed a commit that referenced this issue Jun 7, 2023
This patch addresses the issue of keeping transparent index entries in
paragraphs. It does not address "visible" index entries, which are
formatted with ((...)).

This addresses GH #416.
@suddenfall
Copy link
Contributor Author

Hi,

I have tested your fix with various patterns. It's works very well. The observed behavior is that index terms are pushed to the beginning of a text segment. I like it.

I have also tested my asciidoc project. Unfortunately, I run into an issue. I wrote following in the document:

(((Bedienelement, {ntb}Input Capture Source{nte})))

{ntb} and {nte} are empty variables I use to mark text, that shoudn't be translated (ntb => no translation begin; nte => no translation end). I got following error message:

Unescaped left brace in regex is passed through in regex; marked by <-- HERE in m/\(\(\(Bedienelement, { <-- HERE ntb}Onboard DAC{nte}\)\)\)\n?/ at /home/ds/perl5/lib/perl5/Locale/Po4a/AsciiDoc.pm line 1186.

I have also re-read https://docs.asciidoctor.org/asciidoc/latest/sections/user-index/ and I have learned some new details. I didn't know the difference a flow and concealed index terms. Both types are handled correctly.

But I have found 2 special cases, that are also not considered, now:

  1. indexterm:[...] is an alternative syntax to (((...))) and not respected on parsing.
  2. (((knight, "Arthur, King"))) should result to knight and Arthur, King but currently it results to knight, "Arthur and King".

BTW, I'd also be interested in your regexps for transforming inline formats if you can share them.

No problem. It's my first python code, because bash doesn't support po files ;-). The code will meets only my private requirements, is quick and dirty written and will probably not be suitable to other.

def encode(txt, test = False):	
	# Replace < >	
	txt = re.sub("([^<])<([^<])","\g<1>&lt;\g<2>", txt, 0, re.DOTALL)	
	txt = re.sub("([^>])>([^>])","\g<1>&gt;\g<2>", txt, 0, re.DOTALL)	
	txt = txt.replace("<<<","&lt;&lt;&lt;")	
	
	# Replace xml entities with <e/>
	txt = re.sub("&([^ ^;]*?);","<e>\g<1></e>", txt, 0, re.DOTALL)	

	# Replace "no translate blocks" with <x/>
	txt = re.sub("{ntb}(.*?){nte}","<x>\g<1></x>", txt, 0, re.DOTALL)	
	if txt.find("{ntb}") != -1:
		print("=== encode error  {ntb}:", txt)
		sys.exit(1)
	if txt.find("{nte}") != -1:
		print("=== encode error {nte}: ", txt)
		sys.exit(1)	
	if test:
		txt = re.sub("<x>.*?</x>","", txt, 0, re.DOTALL)
		
	# Replace lines ending with :: with <y/>
	txt = re.sub("(.*?)::( *)\n","\g<1><y>::\g<2></y>\n", txt, 0, re.DOTALL)
	if test:
		txt = re.sub("<y>.*?</y>","", txt, 0, re.DOTALL)

	# https://docs.asciidoctor.org/asciidoc/latest/sections/user-index/#placement-of-hidden-index-terms
	# Replace index terms with <t/>
	txt = txt.replace("(((","<t>").replace(")))","</t>")
	txt = txt.replace("((","<t2>").replace("))","</t2>")	

	#https://docs.asciidoctor.org/asciidoc/latest/macros/xref-validate/
	# Replace references with <a/> or <d/>
	txt = re.sub("<<([^,^>]*?)>>","<a>\g<1></a>", txt, 0, re.DOTALL)
	txt = re.sub("<<([^,^>]+?),([^,^>]+?)>>","<d href='\g<1>'>\g<2></d>", txt, 0, re.DOTALL)

	#https://docs.asciidoctor.org/asciidoc/latest/macros/xref-validate/
	# Replace section references with <r/>
	txt = re.sub("\[\[(.*?)\]\]","<r>\g<1></r>", txt, 0, re.DOTALL)
	if test:
		txt = re.sub("<r>.*?</r>","", txt, 0, re.DOTALL)

	#
	# Replace section references with <c/>
	txt = re.sub("\[#(.*?)\]","<c>\g<1></c>", txt, 0, re.DOTALL)
	if test:
		txt = re.sub("<c>.*?</c>","", txt, 0, re.DOTALL)	
		
	#
	# Replace underline with <u/>
	txt = re.sub("\[underline\]#([^#]*?)#","<u>\g<1></u>", txt, 0, re.DOTALL)
	if test:
		txt = re.sub("<r>.*?</r>","", txt, 0, re.DOTALL)

	# Replace "|" with <z/>
	txt = re.sub("\|","<z/>", txt, 0, re.DOTALL)

	#https://docs.asciidoctor.org/asciidoc/latest/text/bold/#bold-syntax
	# Replace bold words phrases with <b/>
	txt = re.sub("\*(.*?)\*","<b>\g<1></b>", txt, 0, re.DOTALL)

	# Replace variables with <v/>
	txt = re.sub("{(.*?)}","<v>\g<1></v>", txt, 0, re.DOTALL)
	if test:
		txt = re.sub("<v>.*?</v>","", txt, 0, re.DOTALL)	

	# Replace newline with <n/>
	txt = txt.replace("\n","<n/>")	

	# Replace some stuff in test mode
	if test:
		my_char=ord('a') # convert char to ascii
		while my_char<= ord('z'):
			txt = re.sub("<" + chr(my_char) + ">(.*?)</" + chr(my_char) + ">","??" + chr(my_char) + "??\g<1>??" + chr(my_char) + "??", txt, 0, re.DOTALL)
			my_char+=1
			
		txt = re.sub("<(.*?)>(.*?)</(.*?)>","??\g<1>??\g<2>??\g<3>??", txt, 0, re.DOTALL)		
		txt = re.sub("<(.*?)/>","", txt, 0, re.DOTALL)
		txt = re.sub("( .*?)='(.*?)'","\g<1>?'\g<2>'", txt, 0, re.DOTALL)
		txt = re.sub("\((.*?)\)","\g<1>", txt, 0, re.DOTALL)
		txt = re.sub("([^-])-([^-])","\g<1>\g<2>", txt, 0, re.DOTALL)
		txt = re.sub("([^\+])\+([^\+])","\g<1>\g<2>", txt, 0, re.DOTALL)
		txt = re.sub("([^=])=([^=])","\g<1>\g<2>", txt, 0, re.DOTALL)
		txt = re.sub("---\n","", txt, 0, re.DOTALL)
		txt = re.sub("<<<","", txt, 0, re.DOTALL)
		txt = re.sub("\|::","", txt, 0, re.DOTALL)
		txt = re.sub("\[ \]","", txt, 0, re.DOTALL)
		txt = re.sub("\+$","", txt, 0, re.DOTALL)
		txt = re.sub("[\s$]{1}\\\\[\s$]{1}","", txt, 0, re.DOTALL)
		txt = re.sub(" \& ","", txt, 0, re.DOTALL)
		txt = re.sub("&nbsp;","", txt, 0, re.DOTALL)
		txt = re.sub("\\\\$","", txt, 0, re.DOTALL)
		
	return txt

def decode(txt):
	# Processing <e/>
	txt = re.sub("<e>([^w]*?)</e>","&\g<1>;", txt, 0, re.DOTALL)
	
	# Processing &lt; and &gt;
	txt = txt.replace("&lt;","<").replace("&gt;",">")
	
	# Processing <x/>
	txt = re.sub("<x>(.*?)</x>","{ntb}\g<1>{nte}", txt, 0, re.DOTALL)
		
	# Processing <y/>
	txt = txt.replace("<y>","").replace("</y>","")

	# Processing <t/> and <t2/>
	txt = txt.replace("<t>","(((").replace("</t>",")))")
	txt = txt.replace("<t2>","((").replace("</t2>","))")

	# Processing <a/>
	txt = txt.replace("<a>","<<").replace("</a>",">>")
	
	# Processing <d/>
	txt = re.sub("<d href='(.*?)'>(.*?)</d>", "<<\g<1>,\g<2>>>", txt, 0, re.DOTALL)

	# Processing <r/>
	txt = txt.replace("<r>","[[").replace("</r>","]]")

	# Processing <c/>
	txt = txt.replace("<c>","[#").replace("</c>","]")
	
	# Processing <u/>
	txt = txt.replace("<u>","[underline]#").replace("</u>","#")
	
	# Processing <z/>
	txt = txt.replace("<z/>","|")

	# Processing <b/>
	txt = txt.replace("<b>","*").replace("</b>","*")

	# Replace variables with <v/>
	txt = txt.replace("<v>","{").replace("</v>","}")
	
	# Replace newline with <n/>
	txt = txt.replace("<n/>", "\n")
	
	return txt

Have a nice weekend :-)

jnavila added a commit to jnavila/po4a that referenced this issue Jun 10, 2023
This takes into account:

 * variables in the index entry
 * quoted index entries
jnavila added a commit to jnavila/po4a that referenced this issue Jun 10, 2023
This takes into account:

 * variables in the index entry
 * quoted index entries
jnavila added a commit to jnavila/po4a that referenced this issue Jun 10, 2023
This takes into account:

 * variables in the index entry
 * quoted index entries
@jnavila
Copy link
Collaborator

jnavila commented Jun 11, 2023

Thank you for your code. Very instructive indeed.

  • You expect to have only one nte block per segment. I guess your formalism could be replaced by a asciidoc macro.
  • the line "# Replace lines ending with :: with " is surprising. In asciidoc, this is a definition list and it should be covered. Do you have a sample of such occurence?
  • Concealed index terms will be managed, but flow ones are more not. We should check at least that all the same flow ones in English are translated the same in the target language. I don't know how to do it.
  • Are section references in the translatable content? I would expect writers to use completely symbolic anchors and not rely on translatable text, which is very brittle.
  • is replacing pipes with '<z\>' for tables? The asciidoc module has a tablecells option that already splits the table cells into their own segments
  • What is the rationale behind the replacement of carriage returns? If the segment is "no-wrap", it is already stated that carriage returns must be preserved.

One other surprising thing is that you are not completely concealing the refs, variable names and so on. For instance for variables, I would have more expected some regex in the line:

txt = re.sub("{(.*?)}","<v name=\"\g<1>\"/>", txt, 0, re.DOTALL)

so that the name of fhe variable is hidden to the translator, thus limiting the chances of errors.

@suddenfall
Copy link
Contributor Author

Today, I have refactored some documents and verified the changes in the translated and generated pdfs. I have also verified your last fix. Overall it was a lot of work. Anyway, everything is fine now! The fix works perfectly.

You expect to have only one nte block per segment. I guess your formalism could be replaced by a asciidoc macro.

I have definitely multiple nte blocks per sentence. So far, it works fine to me. I grep the blocks none-greedy using "?". If I misunderstood you, feel free to ask again.

the line "# Replace lines ending with :: with " is surprising. In asciidoc, this is a definition list and it should be covered. Do you have a sample of such occurence?

As far I can remember, the "::" is an issue on translating. Deepl fixes this misspelling to ":".

Concealed index terms will be managed, but flow ones are more not. We should check at least that all the same flow ones in English are translated the same in the target language. I don't know how to do it.

I'm not sure what you mean exactly here. Can you possibly give me an example?
BTW, today I have rewritten some section titles to flow index terms and it works perfectly as expected.

Are section references in the translatable content? I would expect writers to use completely symbolic anchors and not rely on translatable text, which is very brittle.

I'm not sure, what you mean. Maybe you mean (explicit) section ids (https://docs.asciidoctor.org/asciidoc/latest/sections/custom-ids/)? If so, I solved the problem this way:

[#connectors]
== Connectors
Some text.

== Other title
See the section {ntb}<<connectors>>{nte} for more information. 

is replacing pipes with '<z>' for tables? The asciidoc module has a tablecells option that already splits the table cells into their own segments

Thanks for the hint. That's exactly what I want. I have fixed my docs today.

What is the rationale behind the replacement of carriage returns? If the segment is "no-wrap", it is already stated that carriage returns must be preserved.

Unfortunately, I can't remember exactly what the problem was. It must be a problem with Deepl though. Deepl probably discards "\n" in certain situations, making this workaround necessary to preserve the original formatting.

One other surprising thing is that you are not completely concealing the refs, variable names and so on. For instance for variables, I would have more expected some regex in the line:
txt = re.sub("{(.*?)}","<v name=\"\g<1>\"/>", txt, 0, re.DOTALL)
so that the name of fhe variable is hidden to the translator, thus limiting the chances of errors.

Yes, that will also work, but requires more characters. You can set the ignored tags on a request to Deepl. Then the ignored tags are not translated. Here is the code snippet, how I call the deepl translator:

result = translator.translate_text(txt, \
			tag_handling = "xml", \
			preserve_formatting = True, \
			outline_detection = False, \
			source_lang = src_lang, \
			target_lang = dst_lang, \
			ignore_tags = "c,e,r,t,v,x,y", \
			split_sentences = "nonewlines", \
			non_splitting_tags = "a,b,d,u")

I use the lib https://pypi.org/project/deepl/.

@suddenfall
Copy link
Contributor Author

At this point I would like to thank you for your great support. It was a lot fun!

@jnavila
Copy link
Collaborator

jnavila commented Jun 12, 2023

Today, I have refactored some documents and verified the changes in the translated and generated pdfs. I have also verified your last fix. Overall it was a lot of work. Anyway, everything is fine now! The fix works perfectly.

Delighted to read this!

You expect to have only one nte block per segment. I guess your formalism could be replaced by a asciidoc macro.

I have definitely multiple nte blocks per sentence. So far, it works fine to me. I grep the blocks none-greedy using "?". If I misunderstood you, feel free to ask again.

My mistake, you're right.

the line "# Replace lines ending with :: with " is surprising. In asciidoc, this is a definition list and it should be covered. Do you have a sample of such occurence?

As far I can remember, the "::" is an issue on translating. Deepl fixes this misspelling to ":".

I understand that you don't want Deepl to modify the string. What I don't understand is how po4a can produce such strings. Definition lists are supposed to be managed by po4a, which removes the '::" That's why I'd be thankful for an example of such case.

Concealed index terms will be managed, but flow ones are more not. We should check at least that all the same flow ones in English are translated the same in the target language. I don't know how to do it.

I'm not sure what you mean exactly here. Can you possibly give me an example? BTW, today I have rewritten some section titles to flow index terms and it works perfectly as expected.

The idea is this: normally, terms for the index which are the same, appear on the same line in the index. For instance:

Mode
  Windows
    34, 56, 78

indicates that this terms are met at pages 34, 56 and 78. But for this to work correctly, the references at these pages need to hold exactly (((Mode,Windows))). Otherwise, the terms won't be sorted under the same entry. This relation must hold homomorphicly during the translation, that is that the translated terms must be the same in their places.

This is managed with concealed terms, because now, they are extracted and they will appear in the po file under a single po entry. Thus, they can only be translated to a unique term that will be copied in all the original locations.

But for flow terms, the issue is that we are not sure that the translator will stick to the same terms where the identical terms were appearing in the original document. In fact, it could even be impossible: think of German declinations that could change terms that were identical in English. To me, texts that are designed for internationalization should not use flow terms.

Are section references in the translatable content? I would expect writers to use completely symbolic anchors and not rely on translatable text, which is very brittle.

I'm not sure, what you mean. Maybe you mean (explicit) section ids (https://docs.asciidoctor.org/asciidoc/latest/sections/custom-ids/)? If so, I solved the problem this way:

[#connectors]
== Connectors
Some text.

== Other title
See the section {ntb}<<connectors>>{nte} for more information. 

OK, that works as I though. The anchors are maintained away from translators. Still, you are using the regex "txt = re.sub("[[(.*?)]]","\g<1>", txt, 0, re.DOTALL)" which hides them. I was expecting po4a to do it for you.

What is the rationale behind the replacement of carriage returns? If the segment is "no-wrap", it is already stated that carriage returns must be preserved.

Unfortunately, I can't remember exactly what the problem was. It must be a problem with Deepl though. Deepl probably discards "\n" in certain situations, making this workaround necessary to preserve the original formatting.

I'm still surprised that you can make Deepl handle carriage returns just like another grammatical block in the sentence.

One other surprising thing is that you are not completely concealing the refs, variable names and so on. For instance for variables, I would have more expected some regex in the line:
txt = re.sub("{(.*?)}","<v name=\"\g<1>\"/>", txt, 0, re.DOTALL)
so that the name of fhe variable is hidden to the translator, thus limiting the chances of errors.

Yes, that will also work, but requires more characters. You can set the ignored tags on a request to Deepl. Then the ignored tags are not translated. Here is the code snippet, how I call the deepl translator:

result = translator.translate_text(txt, \
			tag_handling = "xml", \
			preserve_formatting = True, \
			outline_detection = False, \
			source_lang = src_lang, \
			target_lang = dst_lang, \
			ignore_tags = "c,e,r,t,v,x,y", \
			split_sentences = "nonewlines", \
			non_splitting_tags = "a,b,d,u")

I use the lib https://pypi.org/project/deepl/.

Ah, that makes sense; I didn't know Deepl was able to make xml nodes as placeholders.

In the end, I think a big part of your script could be made a special mode of po4a, if you allow. I'd like to experiment on how other tools may handle these transformations.

Thanks for your support, it makes po4a more useable.

@suddenfall
Copy link
Contributor Author

I understand that you don't want Deepl to modify the string. What I don't understand is how po4a can produce such strings. Definition lists are supposed to be managed by po4a, which removes the '::" That's why I'd be thankful for an example of such case.

I have found two examples:

|::
         A vertical bar in angle or square brackets indicates that the user can choose between alternative inputs or values.

and

{ntb}DSP{nte}::
        Is the abbreviation for {ntb}<<dsp,**[underline]##D##**igital **[underline]##S##**ound **[underline]##P##**rocessor>>{nte}.

The idea is this: normally, terms for the index which are the same, appear on the same line in the index. For instance:

Mode
  Windows
    34, 56, 78

indicates that this terms are met at pages 34, 56 and 78. But for this to work correctly, the references at these pages need to hold exactly (((Mode,Windows))). Otherwise, the terms won't be sorted under the same entry. This relation must hold homomorphicly during the translation, that is that the translated terms must be the same in their places.

This is managed with concealed terms, because now, they are extracted and they will appear in the po file under a single po entry. Thus, they can only be translated to a unique term that will be copied in all the original locations.

But for flow terms, the issue is that we are not sure that the translator will stick to the same terms where the identical terms were appearing in the original document. In fact, it could even be impossible: think of German declinations that could change terms that were identical in English. To me, texts that are designed for internationalization should not use flow terms.

I agree with you. Concealed index terms should be always preferred to flow index terms. Bad luck, that I have also used the flow terms. Time for another refactoring (sometime).

OK, that works as I though. The anchors are maintained away from translators. Still, you are using the regex "txt = re.sub("[[(.*?)]]","\g<1>", txt, 0, re.DOTALL)" which hides them. I was expecting po4a to do it for you.

I think, inline anchors are comparable to flow index terms. Both expressions are used somewhere inside a sentence. The difference is, flow index terms should be translated (and than it's better to use concealed index terms, as you have mentioned) but inline anchors shouldn't be translated. As far as I understand you, you want to hide the inline anchor from the translator, because he hasn't to translate the word. That's a nice idea, but the translator has to decide, where the inline anchor has to be placed inside translated text. I don't think, you can solve this with po4a. Moving the anchors to the beginning of a segment, is in case of my documents not a good idea, because the reference no longer makes sense. Therefore, I would keep it as it is.

I'm still surprised that you can make Deepl handle carriage returns just like another grammatical block in the sentence.

Hm, your are right. Maybe, I should add "n" to the ignore tags. But on the other hand, it works currently without trouble and I don't want to change a running system ;-).

In the end, I think a big part of your script could be made a special mode of po4a, if you allow. I'd like to experiment on how other tools may handle these transformations.

Yes, of course. I would be happy if others could benefit from it too. If you like, I can upload the whole scripts as attachments. Then you see my full workflow and that may give you some inspirations.

Thanks for your support, it makes po4a more useable.

You're welcome.

@jnavila
Copy link
Collaborator

jnavila commented Jun 13, 2023

I understand that you don't want Deepl to modify the string. What I don't understand is how po4a can produce such strings. Definition lists are supposed to be managed by po4a, which removes the '::" That's why I'd be thankful for an example of such case.

I have found two examples:

|::
         A vertical bar in angle or square brackets indicates that the user can choose between alternative inputs or values.

and

{ntb}DSP{nte}::
        Is the abbreviation for {ntb}<<dsp,**[underline]##D##**igital **[underline]##S##**ound **[underline]##P##**rocessor>>{nte}.

And you found two bugs! I'll generate a fix shortly.

OK, that works as I though. The anchors are maintained away from translators. Still, you are using the regex "txt = re.sub("[[(.*?)]]","\g<1>", txt, 0, re.DOTALL)" which hides them. I was expecting po4a to do it for you.

I think, inline anchors are comparable to flow index terms. Both expressions are used somewhere inside a sentence. The difference is, flow index terms should be translated (and than it's better to use concealed index terms, as you have mentioned) but inline anchors shouldn't be translated. As far as I understand you, you want to hide the inline anchor from the translator, because he hasn't to translate the word. That's a nice idea, but the translator has to decide, where the inline anchor has to be placed inside translated text. I don't think, you can solve this with po4a. Moving the anchors to the beginning of a segment, is in case of my documents not a good idea, because the reference no longer makes sense. Therefore, I would keep it as it is.

OK. I was about to propose the algorithm that you just rejected. I wonder if an ID really creates an anchor at the place where it appears in the text. Thinking of it, I would expect the ID to be attached to the first enclosing block or the adjacent inline quote, because it has to be attached to an html node.

@suddenfall
Copy link
Contributor Author

I have just test it. In both PDF and HTML, the anchor is in the exact position in the text as it was written in AsciiDoc. In HTML, for example, is inserted at this point.

@jnavila
Copy link
Collaborator

jnavila commented Jun 17, 2023

I think that all the fixable bugs were fixed in the PR. Thank you @suddenfall again for taking the time to describe the bugs. The xml conversion is on the TODO list.

@suddenfall
Copy link
Contributor Author

<badge style="platin">Tested & Approved</badge>

mquinson pushed a commit that referenced this issue Jun 19, 2023
This takes into account:

 * variables in the index entry
 * quoted index entries
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants