Asciidoc / Keeping original formating #416

suddenfall · 2023-05-22T12:06:10Z

I have feature request for the asciidoc module, that may overlap with:
#291

I use Asciidoctor and for me it is also a problem that the content from my master documents are wrapped in the po-files and thus reformatted. I propose to introduce an option in the Asciidoc module so that the original formatting in the po files is preserved.

Motivation

Currently, I use deepl API. Reformatting the content makes it harder to use the API optimally. In addition, reformatting makes it necessary for me to check the translated documents after rendering to PDF whether the original formatting from the master documents has been retained.

I have structured my master documents according to the principle: One line contains exactly one sentence.

That would fit perfectly to deepl API, where a text line is interpret a sentence. Unfortunately, I have not found a way with po4a or the Asciidoc module to prevent the reformatting of the content.

Therefore I have to enable in the deepl API that sentences are recognized by punctuation marks. This works, but leads to problems with index terms, for example. If I write this content:

== Maximus Primus
(((Primus Magnificus)))
Lorem ipsum dolor sit amet, consectetur adipisici elit, sed eiusmod tempor incidunt ut labore et dolore magna aliqua. 
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquid ex ea commodi consequat. 
Quis aute iure reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. 
Excepteur sint obcaecat cupiditat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

It results to following message id:

"(((Primus Magnificus)))  Lorem ipsum dolor sit amet, consectetur adipisici "
"elit, sed eiusmod tempor incidunt ut labore et dolore magna aliqua.  Ut enim "
"ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquid "
"ex ea commodi consequat.  Quis aute iure reprehenderit in voluptate velit "
"esse cillum dolore eu fugiat nulla pariatur.  Excepteur sint obcaecat "
"cupiditat non proident, sunt in culpa qui officia deserunt mollit anim id "
"est laborum."

The index term (((Primus Magnificus))) is prepend to the content in the same line. This can lead to upper and lower case being mixed up in the translation and even to the entire sentence being distorted in terms of sense.

Plea for the feature

Keeping the original formatting from the master documents simplifies the whole process, as you only have to worry about the translation but not also about the formatting in the output, knowing that it will not be changed.

For AI translation tools such as deepl, a maximal line length is not relevant. On the contrary, limiting the line length is even counterproductive.

To ensure backwards compatibility, I suggest that this feature can be activated via an option.

The text was updated successfully, but these errors were encountered:

jnavila · 2023-05-27T18:07:45Z

Hello

I think that there are several points mixed together in this issue. Let me list as I see them:

You are complaining that the default segmentation in asciidoc module is the paragraph instead of the sentence. The reason is not clear, because it is mixed with the following points. Is Deepl really wrecking when translating a series of sentences instead of a unique sentence at a time. My experience with Deepl in Weblate is more the opposite, as the paragraph level provides better context for translating. I'd like a clarification on that, if it really makes sense to add an option for sentence level segmentation.
You say that the paragraph level of segmentation makes you deal with the formatting of the output. The whole point of the asciidoc module, compared to the basic text module is to know enough of asciidoc's block formatting syntax to extract it and preserve the translators from having to deal with it. The output of translated text is in a "normalized" form, safe for consumption by asciidoc formatters and that should not need being reviewed by translators; if it does not compile to a correct PDF after a faithful translation of all the segments, then this is a bug of po4a.
You are complaining that index terms are spread in the segments. I kind of agree that they should be a segment of their own and not pollute other segments, but, because they can appear nearly anywhere, managing their position in translated content is a hard problem. Your use case is special with regards to the general rule. Or maybe, you can propose a generic simplification on this matter.

suddenfall · 2023-05-29T11:05:05Z

I have no problem with the segmentation to paragraph. That's fine and that's exactly how I need it and it should be. I only want that the original text layout is preserved inside the message id like this:

"(((Primus Magnificus)))"
"Lorem ipsum dolor sit amet, consectetur adipisici elit, sed eiusmod tempor incidunt ut labore et dolore magna aliqua. "
"Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquid ex ea commodi consequat. "
"Quis aute iure reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. "
"Excepteur sint obcaecat cupiditat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."

and not wrapped like this:

"(((Primus Magnificus)))  Lorem ipsum dolor sit amet, consectetur adipisici "
"elit, sed eiusmod tempor incidunt ut labore et dolore magna aliqua.  Ut enim "
"ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquid "
"ex ea commodi consequat.  Quis aute iure reprehenderit in voluptate velit "
"esse cillum dolore eu fugiat nulla pariatur.  Excepteur sint obcaecat "
"cupiditat non proident, sunt in culpa qui officia deserunt mollit anim id "
"est laborum."

I don't want anything more!

In fact, index terms are very difficult to work with. Personally, I only use them at the beginning or end of a sentence and separate them in a different line. Now I have to go back a bit about how I use deepl. When I let deepl translate a message ID, I convert it to XML. This means that I replace, for example, formatting such as "bold" or "italic" as well as links or references with an XML element. I also wrap index terms with an XML element. In the REST request, I can configure which XML elements should be translated and which should not.

At the moment, I'm translating index terms separately and marking the XML element as "don't translate". However, this only works properly to a limited extent, since deepl interprets "don't translate" XML elements as part of a sentence and thus the sentence may be distorted after translation.

The deepl REST-API offers the parameter "split_sentences" for XML content. Currently, I have to use the value "nonewlines" (= splits on punctuation only, ignoring newlines) but I prefer to use the value "1" ( = splits on punctuation and on newlines) (also see: https://www.deepl.com/docs-api/translate-text/translate-text/).

If you preserve the formatting inside the message id, I can handle the index term as a separate sentence and the following sentence starting with "Lorem ipsum dolor" would be translated correctly and won't be mixed up with the index term. That's all I want.

Till now, I have no experience with Weblate and I don't know if Weblate has similar options to the deepl API. Therefore, I cannot give you a more generic rule, yet. If you know a webpage where these options are described for Weblate, I'm interested in, and I will then think about a generic rule.

I hope you can understand my wish better. If something is still unclear, just ask again.

Side note: In fact, is deepl not very good on translating single sentences. Therefore, I always prepend a sentence to set the context. That improves the translation a lot, but also consumes my contingent of translatable characters. I have proposed deepl to add a context parameter to their API to add some context tags or references.

jnavila · 2023-05-31T19:35:49Z

The thing is that the two versions of the po file that you presented are totally identical from the semantic point of view in the po file format. From what I understand, you are processing the po file with a dedicated parser that adds requirements to the po file specification.

The formatting of the po file is not managed from the source format handler (such as the asciidoc parser): all the format handlers call methods from the po file object and it is up to the po file object to write the output file according its own rules. So, to circumvent this, we would have to add a way for the format handler to impose some writing rules to the po file object, not a simple change I presume.

I still think that your issue can be solved by another way, because you are imposing a solution to a problem that does not require it. I mean that your main problem is the presence of the index term in the segment of the paragraph. If I was to try something concerning your issue, I would just remove all the index entries of all the segments and put them in their own segment. This would have two positive outcomes:

the paragraph would be free of disturbing chunks of words that make Deepl misinterpret them.
the index terms would correctly be tagged in the po file, grouped together and translated at once when they appear at different places in the document. They need to be translated anyway but they need to be kept identical. They should even be split in theirs subterms so that the hierarchy of terms is preserved.

What do you think?

suddenfall · 2023-05-31T21:29:22Z

The thing is that the two versions of the po file that you presented are totally identical from the semantic point of view in the po file format. From what I understand, you are processing the po file with a dedicated parser that adds requirements to the po file specification.

Yes, that's right, but I wouldn't call at least 11 regular expression a parser. But that's not relevant ;-)

I like your approach to solve the problem in the asciidoc module, since the problem only affects asciidoc. I see the same advantages in the solution as you have already described and would prefer your solution.

Furthermore, you could then also think about, for example, also mapping table cells each as a separate message to be translated according to the same schema.

After a bit of thinking about the index terms, however, I still see two small detail problems:

How should you handle index terms that are somewhere in the sentence and not at the beginning or end of the sentence. I suspect that this problem is rather academic only. And even if the index term is not at the beginning of the sentence, I haven't (yet) thought of a use case where it would be a problem to move the index term to the beginning of the sentence after translation.
Next academic detail problem could be, if several sentences with one index term each are put together to a message to be translated. Is it then better to map the individual sentences as a separate message in order to be able to correctly assign the index term again later?

I hope that my thoughts help you and I am looking forward to your feedback.

This patch addresses the issue of keeping transparent index entries in paragraphs. It does not address "visible" index entries, which are formatted with ((...)). This addresses GH mquinson#416.

jnavila · 2023-06-07T19:40:50Z

@suddenfall I just pushed a tentative fix for your issue. Can you test it, give a feedback on the feature and point out the failing cases. I'm pretty sure I haven't covered everything.

BTW, I'd also be interested in your regexps for transforming inline formats if you can share them.

This patch addresses the issue of keeping transparent index entries in paragraphs. It does not address "visible" index entries, which are formatted with ((...)). This addresses GH #416.

suddenfall · 2023-06-09T17:40:13Z

Hi,

I have tested your fix with various patterns. It's works very well. The observed behavior is that index terms are pushed to the beginning of a text segment. I like it.

I have also tested my asciidoc project. Unfortunately, I run into an issue. I wrote following in the document:

(((Bedienelement, {ntb}Input Capture Source{nte})))

{ntb} and {nte} are empty variables I use to mark text, that shoudn't be translated (ntb => no translation begin; nte => no translation end). I got following error message:

Unescaped left brace in regex is passed through in regex; marked by <-- HERE in m/\(\(\(Bedienelement, { <-- HERE ntb}Onboard DAC{nte}\)\)\)\n?/ at /home/ds/perl5/lib/perl5/Locale/Po4a/AsciiDoc.pm line 1186.

I have also re-read https://docs.asciidoctor.org/asciidoc/latest/sections/user-index/ and I have learned some new details. I didn't know the difference a flow and concealed index terms. Both types are handled correctly.

But I have found 2 special cases, that are also not considered, now:

indexterm:[...] is an alternative syntax to (((...))) and not respected on parsing.
(((knight, "Arthur, King"))) should result to knight and Arthur, King but currently it results to knight, "Arthur and King".

BTW, I'd also be interested in your regexps for transforming inline formats if you can share them.

No problem. It's my first python code, because bash doesn't support po files ;-). The code will meets only my private requirements, is quick and dirty written and will probably not be suitable to other.

def encode(txt, test = False):	
	# Replace < >	
	txt = re.sub("([^<])<([^<])","\g<1>&lt;\g<2>", txt, 0, re.DOTALL)	
	txt = re.sub("([^>])>([^>])","\g<1>&gt;\g<2>", txt, 0, re.DOTALL)	
	txt = txt.replace("<<<","&lt;&lt;&lt;")	
	
	# Replace xml entities with <e/>
	txt = re.sub("&([^ ^;]*?);","<e>\g<1></e>", txt, 0, re.DOTALL)	

	# Replace "no translate blocks" with <x/>
	txt = re.sub("{ntb}(.*?){nte}","<x>\g<1></x>", txt, 0, re.DOTALL)	
	if txt.find("{ntb}") != -1:
		print("=== encode error  {ntb}:", txt)
		sys.exit(1)
	if txt.find("{nte}") != -1:
		print("=== encode error {nte}: ", txt)
		sys.exit(1)	
	if test:
		txt = re.sub("<x>.*?</x>","", txt, 0, re.DOTALL)
		
	# Replace lines ending with :: with <y/>
	txt = re.sub("(.*?)::( *)\n","\g<1><y>::\g<2></y>\n", txt, 0, re.DOTALL)
	if test:
		txt = re.sub("<y>.*?</y>","", txt, 0, re.DOTALL)

	# https://docs.asciidoctor.org/asciidoc/latest/sections/user-index/#placement-of-hidden-index-terms
	# Replace index terms with <t/>
	txt = txt.replace("(((","<t>").replace(")))","</t>")
	txt = txt.replace("((","<t2>").replace("))","</t2>")	

	#https://docs.asciidoctor.org/asciidoc/latest/macros/xref-validate/
	# Replace references with <a/> or <d/>
	txt = re.sub("<<([^,^>]*?)>>","<a>\g<1></a>", txt, 0, re.DOTALL)
	txt = re.sub("<<([^,^>]+?),([^,^>]+?)>>","<d href='\g<1>'>\g<2></d>", txt, 0, re.DOTALL)

	#https://docs.asciidoctor.org/asciidoc/latest/macros/xref-validate/
	# Replace section references with <r/>
	txt = re.sub("\[\[(.*?)\]\]","<r>\g<1></r>", txt, 0, re.DOTALL)
	if test:
		txt = re.sub("<r>.*?</r>","", txt, 0, re.DOTALL)

	#
	# Replace section references with <c/>
	txt = re.sub("\[#(.*?)\]","<c>\g<1></c>", txt, 0, re.DOTALL)
	if test:
		txt = re.sub("<c>.*?</c>","", txt, 0, re.DOTALL)	
		
	#
	# Replace underline with <u/>
	txt = re.sub("\[underline\]#([^#]*?)#","<u>\g<1></u>", txt, 0, re.DOTALL)
	if test:
		txt = re.sub("<r>.*?</r>","", txt, 0, re.DOTALL)

	# Replace "|" with <z/>
	txt = re.sub("\|","<z/>", txt, 0, re.DOTALL)

	#https://docs.asciidoctor.org/asciidoc/latest/text/bold/#bold-syntax
	# Replace bold words phrases with <b/>
	txt = re.sub("\*(.*?)\*","<b>\g<1></b>", txt, 0, re.DOTALL)

	# Replace variables with <v/>
	txt = re.sub("{(.*?)}","<v>\g<1></v>", txt, 0, re.DOTALL)
	if test:
		txt = re.sub("<v>.*?</v>","", txt, 0, re.DOTALL)	

	# Replace newline with <n/>
	txt = txt.replace("\n","<n/>")	

	# Replace some stuff in test mode
	if test:
		my_char=ord('a') # convert char to ascii
		while my_char<= ord('z'):
			txt = re.sub("<" + chr(my_char) + ">(.*?)</" + chr(my_char) + ">","??" + chr(my_char) + "??\g<1>??" + chr(my_char) + "??", txt, 0, re.DOTALL)
			my_char+=1
			
		txt = re.sub("<(.*?)>(.*?)</(.*?)>","??\g<1>??\g<2>??\g<3>??", txt, 0, re.DOTALL)		
		txt = re.sub("<(.*?)/>","", txt, 0, re.DOTALL)
		txt = re.sub("( .*?)='(.*?)'","\g<1>?'\g<2>'", txt, 0, re.DOTALL)
		txt = re.sub("\((.*?)\)","\g<1>", txt, 0, re.DOTALL)
		txt = re.sub("([^-])-([^-])","\g<1>\g<2>", txt, 0, re.DOTALL)
		txt = re.sub("([^\+])\+([^\+])","\g<1>\g<2>", txt, 0, re.DOTALL)
		txt = re.sub("([^=])=([^=])","\g<1>\g<2>", txt, 0, re.DOTALL)
		txt = re.sub("---\n","", txt, 0, re.DOTALL)
		txt = re.sub("<<<","", txt, 0, re.DOTALL)
		txt = re.sub("\|::","", txt, 0, re.DOTALL)
		txt = re.sub("\[ \]","", txt, 0, re.DOTALL)
		txt = re.sub("\+$","", txt, 0, re.DOTALL)
		txt = re.sub("[\s$]{1}\\\\[\s$]{1}","", txt, 0, re.DOTALL)
		txt = re.sub(" \& ","", txt, 0, re.DOTALL)
		txt = re.sub("&nbsp;","", txt, 0, re.DOTALL)
		txt = re.sub("\\\\$","", txt, 0, re.DOTALL)
		
	return txt

def decode(txt):
	# Processing <e/>
	txt = re.sub("<e>([^w]*?)</e>","&\g<1>;", txt, 0, re.DOTALL)
	
	# Processing &lt; and &gt;
	txt = txt.replace("&lt;","<").replace("&gt;",">")
	
	# Processing <x/>
	txt = re.sub("<x>(.*?)</x>","{ntb}\g<1>{nte}", txt, 0, re.DOTALL)
		
	# Processing <y/>
	txt = txt.replace("<y>","").replace("</y>","")

	# Processing <t/> and <t2/>
	txt = txt.replace("<t>","(((").replace("</t>",")))")
	txt = txt.replace("<t2>","((").replace("</t2>","))")

	# Processing <a/>
	txt = txt.replace("<a>","<<").replace("</a>",">>")
	
	# Processing <d/>
	txt = re.sub("<d href='(.*?)'>(.*?)</d>", "<<\g<1>,\g<2>>>", txt, 0, re.DOTALL)

	# Processing <r/>
	txt = txt.replace("<r>","[[").replace("</r>","]]")

	# Processing <c/>
	txt = txt.replace("<c>","[#").replace("</c>","]")
	
	# Processing <u/>
	txt = txt.replace("<u>","[underline]#").replace("</u>","#")
	
	# Processing <z/>
	txt = txt.replace("<z/>","|")

	# Processing <b/>
	txt = txt.replace("<b>","*").replace("</b>","*")

	# Replace variables with <v/>
	txt = txt.replace("<v>","{").replace("</v>","}")
	
	# Replace newline with <n/>
	txt = txt.replace("<n/>", "\n")
	
	return txt

Have a nice weekend :-)

This takes into account: * variables in the index entry * quoted index entries

jnavila · 2023-06-11T18:53:32Z

Thank you for your code. Very instructive indeed.

You expect to have only one nte block per segment. I guess your formalism could be replaced by a asciidoc macro.
the line "# Replace lines ending with :: with " is surprising. In asciidoc, this is a definition list and it should be covered. Do you have a sample of such occurence?
Concealed index terms will be managed, but flow ones are more not. We should check at least that all the same flow ones in English are translated the same in the target language. I don't know how to do it.
Are section references in the translatable content? I would expect writers to use completely symbolic anchors and not rely on translatable text, which is very brittle.
is replacing pipes with '<z\>' for tables? The asciidoc module has a tablecells option that already splits the table cells into their own segments
What is the rationale behind the replacement of carriage returns? If the segment is "no-wrap", it is already stated that carriage returns must be preserved.

One other surprising thing is that you are not completely concealing the refs, variable names and so on. For instance for variables, I would have more expected some regex in the line:

txt = re.sub("{(.*?)}","<v name=\"\g<1>\"/>", txt, 0, re.DOTALL)

so that the name of fhe variable is hidden to the translator, thus limiting the chances of errors.

suddenfall · 2023-06-12T14:28:03Z

Today, I have refactored some documents and verified the changes in the translated and generated pdfs. I have also verified your last fix. Overall it was a lot of work. Anyway, everything is fine now! The fix works perfectly.

You expect to have only one nte block per segment. I guess your formalism could be replaced by a asciidoc macro.

I have definitely multiple nte blocks per sentence. So far, it works fine to me. I grep the blocks none-greedy using "?". If I misunderstood you, feel free to ask again.

the line "# Replace lines ending with :: with " is surprising. In asciidoc, this is a definition list and it should be covered. Do you have a sample of such occurence?

As far I can remember, the "::" is an issue on translating. Deepl fixes this misspelling to ":".

Concealed index terms will be managed, but flow ones are more not. We should check at least that all the same flow ones in English are translated the same in the target language. I don't know how to do it.

I'm not sure what you mean exactly here. Can you possibly give me an example?
BTW, today I have rewritten some section titles to flow index terms and it works perfectly as expected.

Are section references in the translatable content? I would expect writers to use completely symbolic anchors and not rely on translatable text, which is very brittle.

I'm not sure, what you mean. Maybe you mean (explicit) section ids (https://docs.asciidoctor.org/asciidoc/latest/sections/custom-ids/)? If so, I solved the problem this way:

[#connectors]
== Connectors
Some text.

== Other title
See the section {ntb}<<connectors>>{nte} for more information.

is replacing pipes with '<z>' for tables? The asciidoc module has a tablecells option that already splits the table cells into their own segments

Thanks for the hint. That's exactly what I want. I have fixed my docs today.

What is the rationale behind the replacement of carriage returns? If the segment is "no-wrap", it is already stated that carriage returns must be preserved.

Unfortunately, I can't remember exactly what the problem was. It must be a problem with Deepl though. Deepl probably discards "\n" in certain situations, making this workaround necessary to preserve the original formatting.

One other surprising thing is that you are not completely concealing the refs, variable names and so on. For instance for variables, I would have more expected some regex in the line:
txt = re.sub("{(.*?)}","<v name=\"\g<1>\"/>", txt, 0, re.DOTALL)
so that the name of fhe variable is hidden to the translator, thus limiting the chances of errors.

Yes, that will also work, but requires more characters. You can set the ignored tags on a request to Deepl. Then the ignored tags are not translated. Here is the code snippet, how I call the deepl translator:

result = translator.translate_text(txt, \
			tag_handling = "xml", \
			preserve_formatting = True, \
			outline_detection = False, \
			source_lang = src_lang, \
			target_lang = dst_lang, \
			ignore_tags = "c,e,r,t,v,x,y", \
			split_sentences = "nonewlines", \
			non_splitting_tags = "a,b,d,u")

I use the lib https://pypi.org/project/deepl/.

suddenfall · 2023-06-12T17:03:39Z

At this point I would like to thank you for your great support. It was a lot fun!

jnavila · 2023-06-12T19:27:36Z

Today, I have refactored some documents and verified the changes in the translated and generated pdfs. I have also verified your last fix. Overall it was a lot of work. Anyway, everything is fine now! The fix works perfectly.

Delighted to read this!

You expect to have only one nte block per segment. I guess your formalism could be replaced by a asciidoc macro.

I have definitely multiple nte blocks per sentence. So far, it works fine to me. I grep the blocks none-greedy using "?". If I misunderstood you, feel free to ask again.

My mistake, you're right.

the line "# Replace lines ending with :: with " is surprising. In asciidoc, this is a definition list and it should be covered. Do you have a sample of such occurence?

As far I can remember, the "::" is an issue on translating. Deepl fixes this misspelling to ":".

I understand that you don't want Deepl to modify the string. What I don't understand is how po4a can produce such strings. Definition lists are supposed to be managed by po4a, which removes the '::" That's why I'd be thankful for an example of such case.

Concealed index terms will be managed, but flow ones are more not. We should check at least that all the same flow ones in English are translated the same in the target language. I don't know how to do it.

I'm not sure what you mean exactly here. Can you possibly give me an example? BTW, today I have rewritten some section titles to flow index terms and it works perfectly as expected.

The idea is this: normally, terms for the index which are the same, appear on the same line in the index. For instance:

Mode
  Windows
    34, 56, 78

indicates that this terms are met at pages 34, 56 and 78. But for this to work correctly, the references at these pages need to hold exactly (((Mode,Windows))). Otherwise, the terms won't be sorted under the same entry. This relation must hold homomorphicly during the translation, that is that the translated terms must be the same in their places.

This is managed with concealed terms, because now, they are extracted and they will appear in the po file under a single po entry. Thus, they can only be translated to a unique term that will be copied in all the original locations.

But for flow terms, the issue is that we are not sure that the translator will stick to the same terms where the identical terms were appearing in the original document. In fact, it could even be impossible: think of German declinations that could change terms that were identical in English. To me, texts that are designed for internationalization should not use flow terms.

Are section references in the translatable content? I would expect writers to use completely symbolic anchors and not rely on translatable text, which is very brittle.

I'm not sure, what you mean. Maybe you mean (explicit) section ids (https://docs.asciidoctor.org/asciidoc/latest/sections/custom-ids/)? If so, I solved the problem this way:
[#connectors]
== Connectors
Some text.

== Other title
See the section {ntb}<<connectors>>{nte} for more information. 

OK, that works as I though. The anchors are maintained away from translators. Still, you are using the regex "txt = re.sub("[[(.*?)]]","\g<1>", txt, 0, re.DOTALL)" which hides them. I was expecting po4a to do it for you.

What is the rationale behind the replacement of carriage returns? If the segment is "no-wrap", it is already stated that carriage returns must be preserved.

Unfortunately, I can't remember exactly what the problem was. It must be a problem with Deepl though. Deepl probably discards "\n" in certain situations, making this workaround necessary to preserve the original formatting.

I'm still surprised that you can make Deepl handle carriage returns just like another grammatical block in the sentence.

One other surprising thing is that you are not completely concealing the refs, variable names and so on. For instance for variables, I would have more expected some regex in the line:
txt = re.sub("{(.*?)}","<v name=\"\g<1>\"/>", txt, 0, re.DOTALL)
so that the name of fhe variable is hidden to the translator, thus limiting the chances of errors.

Yes, that will also work, but requires more characters. You can set the ignored tags on a request to Deepl. Then the ignored tags are not translated. Here is the code snippet, how I call the deepl translator:
result = translator.translate_text(txt, \
			tag_handling = "xml", \
			preserve_formatting = True, \
			outline_detection = False, \
			source_lang = src_lang, \
			target_lang = dst_lang, \
			ignore_tags = "c,e,r,t,v,x,y", \
			split_sentences = "nonewlines", \
			non_splitting_tags = "a,b,d,u")
I use the lib https://pypi.org/project/deepl/.

Ah, that makes sense; I didn't know Deepl was able to make xml nodes as placeholders.

In the end, I think a big part of your script could be made a special mode of po4a, if you allow. I'd like to experiment on how other tools may handle these transformations.

Thanks for your support, it makes po4a more useable.

suddenfall · 2023-06-13T10:07:46Z

I understand that you don't want Deepl to modify the string. What I don't understand is how po4a can produce such strings. Definition lists are supposed to be managed by po4a, which removes the '::" That's why I'd be thankful for an example of such case.

I have found two examples:

|::
         A vertical bar in angle or square brackets indicates that the user can choose between alternative inputs or values.

and

{ntb}DSP{nte}::
        Is the abbreviation for {ntb}<<dsp,**[underline]##D##**igital **[underline]##S##**ound **[underline]##P##**rocessor>>{nte}.

The idea is this: normally, terms for the index which are the same, appear on the same line in the index. For instance:

Mode
  Windows
    34, 56, 78

indicates that this terms are met at pages 34, 56 and 78. But for this to work correctly, the references at these pages need to hold exactly (((Mode,Windows))). Otherwise, the terms won't be sorted under the same entry. This relation must hold homomorphicly during the translation, that is that the translated terms must be the same in their places.

This is managed with concealed terms, because now, they are extracted and they will appear in the po file under a single po entry. Thus, they can only be translated to a unique term that will be copied in all the original locations.

But for flow terms, the issue is that we are not sure that the translator will stick to the same terms where the identical terms were appearing in the original document. In fact, it could even be impossible: think of German declinations that could change terms that were identical in English. To me, texts that are designed for internationalization should not use flow terms.

I agree with you. Concealed index terms should be always preferred to flow index terms. Bad luck, that I have also used the flow terms. Time for another refactoring (sometime).

OK, that works as I though. The anchors are maintained away from translators. Still, you are using the regex "txt = re.sub("[[(.*?)]]","\g<1>", txt, 0, re.DOTALL)" which hides them. I was expecting po4a to do it for you.

I think, inline anchors are comparable to flow index terms. Both expressions are used somewhere inside a sentence. The difference is, flow index terms should be translated (and than it's better to use concealed index terms, as you have mentioned) but inline anchors shouldn't be translated. As far as I understand you, you want to hide the inline anchor from the translator, because he hasn't to translate the word. That's a nice idea, but the translator has to decide, where the inline anchor has to be placed inside translated text. I don't think, you can solve this with po4a. Moving the anchors to the beginning of a segment, is in case of my documents not a good idea, because the reference no longer makes sense. Therefore, I would keep it as it is.

I'm still surprised that you can make Deepl handle carriage returns just like another grammatical block in the sentence.

Hm, your are right. Maybe, I should add "n" to the ignore tags. But on the other hand, it works currently without trouble and I don't want to change a running system ;-).

In the end, I think a big part of your script could be made a special mode of po4a, if you allow. I'd like to experiment on how other tools may handle these transformations.

Yes, of course. I would be happy if others could benefit from it too. If you like, I can upload the whole scripts as attachments. Then you see my full workflow and that may give you some inspirations.

Thanks for your support, it makes po4a more useable.

You're welcome.

jnavila · 2023-06-13T20:54:43Z

I understand that you don't want Deepl to modify the string. What I don't understand is how po4a can produce such strings. Definition lists are supposed to be managed by po4a, which removes the '::" That's why I'd be thankful for an example of such case.

I have found two examples:
|::
         A vertical bar in angle or square brackets indicates that the user can choose between alternative inputs or values.
and
{ntb}DSP{nte}::
        Is the abbreviation for {ntb}<<dsp,**[underline]##D##**igital **[underline]##S##**ound **[underline]##P##**rocessor>>{nte}.

And you found two bugs! I'll generate a fix shortly.

OK, that works as I though. The anchors are maintained away from translators. Still, you are using the regex "txt = re.sub("[[(.*?)]]","\g<1>", txt, 0, re.DOTALL)" which hides them. I was expecting po4a to do it for you.

I think, inline anchors are comparable to flow index terms. Both expressions are used somewhere inside a sentence. The difference is, flow index terms should be translated (and than it's better to use concealed index terms, as you have mentioned) but inline anchors shouldn't be translated. As far as I understand you, you want to hide the inline anchor from the translator, because he hasn't to translate the word. That's a nice idea, but the translator has to decide, where the inline anchor has to be placed inside translated text. I don't think, you can solve this with po4a. Moving the anchors to the beginning of a segment, is in case of my documents not a good idea, because the reference no longer makes sense. Therefore, I would keep it as it is.

OK. I was about to propose the algorithm that you just rejected. I wonder if an ID really creates an anchor at the place where it appears in the text. Thinking of it, I would expect the ID to be attached to the first enclosing block or the adjacent inline quote, because it has to be attached to an html node.

suddenfall · 2023-06-14T08:48:42Z

I have just test it. In both PDF and HTML, the anchor is in the exact position in the text as it was written in AsciiDoc. In HTML, for example, is inserted at this point.

jnavila · 2023-06-17T13:05:07Z

I think that all the fixable bugs were fixed in the PR. Thank you @suddenfall again for taking the time to describe the bugs. The xml conversion is on the TODO list.

suddenfall · 2023-06-18T16:40:20Z

<badge style="platin">Tested & Approved</badge>

This takes into account: * variables in the index entry * quoted index entries

jnavila mentioned this issue Jun 7, 2023

asciidoc: process transparent index entries in their own segment #424

Merged

jnavila added a commit to jnavila/po4a that referenced this issue Jun 10, 2023

fix the fix for mquinson#416 :-)

6d78518

This takes into account: * variables in the index entry * quoted index entries

jnavila added a commit to jnavila/po4a that referenced this issue Jun 10, 2023

fix the fix for mquinson#416 :-)

19b44af

This takes into account: * variables in the index entry * quoted index entries

jnavila added a commit to jnavila/po4a that referenced this issue Jun 10, 2023

asciidoc: fix the fix for mquinson#416 :-)

e82af85

This takes into account: * variables in the index entry * quoted index entries

mquinson pushed a commit that referenced this issue Jun 19, 2023

asciidoc: fix the fix for #416 :-)

5a11f4f

This takes into account: * variables in the index entry * quoted index entries

mquinson closed this as completed in b8ea0bd Jun 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Asciidoc / Keeping original formating #416

Asciidoc / Keeping original formating #416

suddenfall commented May 22, 2023 •

edited

Loading

jnavila commented May 27, 2023

suddenfall commented May 29, 2023

jnavila commented May 31, 2023

suddenfall commented May 31, 2023

jnavila commented Jun 7, 2023

suddenfall commented Jun 9, 2023

jnavila commented Jun 11, 2023

suddenfall commented Jun 12, 2023

suddenfall commented Jun 12, 2023

jnavila commented Jun 12, 2023

suddenfall commented Jun 13, 2023

jnavila commented Jun 13, 2023

suddenfall commented Jun 14, 2023

jnavila commented Jun 17, 2023

suddenfall commented Jun 18, 2023

Asciidoc / Keeping original formating #416

Asciidoc / Keeping original formating #416

Comments

suddenfall commented May 22, 2023 • edited Loading

Motivation

Plea for the feature

jnavila commented May 27, 2023

suddenfall commented May 29, 2023

jnavila commented May 31, 2023

suddenfall commented May 31, 2023

jnavila commented Jun 7, 2023

suddenfall commented Jun 9, 2023

jnavila commented Jun 11, 2023

suddenfall commented Jun 12, 2023

suddenfall commented Jun 12, 2023

jnavila commented Jun 12, 2023

suddenfall commented Jun 13, 2023

jnavila commented Jun 13, 2023

suddenfall commented Jun 14, 2023

jnavila commented Jun 17, 2023

suddenfall commented Jun 18, 2023

suddenfall commented May 22, 2023 •

edited

Loading