Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Illegal control characters in XML output #365

Open
jeroen opened this issue Oct 19, 2020 · 12 comments
Open

Illegal control characters in XML output #365

jeroen opened this issue Oct 19, 2020 · 12 comments

Comments

@jeroen
Copy link
Contributor

jeroen commented Oct 19, 2020

Hi, I maintain the R bindings for cmark. One popular use case is converting commonmark to xml for processing the AST.

We are running into a problem when input markdown contains control characters (often captured from a tty), which makes xml output invalid. For example if the markdown text contains \033 and we convert that to xml, we get:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE document SYSTEM "CommonMark.dtd">
<document xmlns="http://commonmark.org/xml/1.0">
  <paragraph>
    <text xml:space="preserve">�</text>
  </paragraph>
</document>

However, trying to parse this with libxml2 fails:

 Error in read_xml.raw(charToRaw(enc2utf8(x)), "UTF-8", ..., as_html = as_html,  : 
  PCDATA invalid Char value 27 [9] 

A real world example is this readme file. This was done with the gfm fork, but I think the problem appears the same.

Is this a bug in cmark, or is markdown text not supposed to contain c0 characters in the first place?

cc @nwellnhof

@nwellnhof
Copy link
Contributor

That's bascially a bug in cmark's XML renderer. Markdown allows C0 control characters, but XML 1.0 doesn't (except some whitespace). The only option I see is to encode control characters with an extra element like:

<text xml:space="preserve">Text containing control charater: <control code="27"/>. More text...</text>

But even this won't work with all XML workflows. XSLT stylesheets like tools/xml2md.xsl simply cannot output C0 control characters.

@jeroen
Copy link
Contributor Author

jeroen commented Oct 19, 2020

Would it be a good idea to introduce a new option, either for cmark_parser_new or cmark_render_xml, to strip C0 characters such that we do never end up with invalid xml?

@jeroen
Copy link
Contributor Author

jeroen commented Oct 19, 2020

Or alternatively, would you be willing to help me some example C code that allows me to substitute those characters from the xml before feeding into xmlReadMemory ? It is not completely clear to me which characters I should be removing, that are not escaped by cmark, but also not supported by the libxml2 parser.

@nwellnhof
Copy link
Contributor

You'd have to strip all control codes between 0x00 and 0x1F, except 0x09, 0x0A and 0x0D. For completeness, you should also strip Unicode code points 0xFFFE and 0xFFFF: https://www.w3.org/TR/REC-xml/#charsets

@jgm
Copy link
Member

jgm commented Oct 19, 2020

I'm no XML expert: would encoding with numeric character references suffice? That would be an easy fix.

@nwellnhof
Copy link
Contributor

No, this doesn't work. From the spec:

Characters referred to using character references MUST match the production for Char.

@jgm
Copy link
Member

jgm commented Oct 19, 2020

If the point of the XML renderer is to faithfully represent the AST, then I see no real alternative to using <control> or something like that. (That isn't ideal, because it gives the impression that these characters are special nodes in the AST, but is there a better way? Dropping the characters seems less faithful.)

@jeroen
Copy link
Contributor Author

jeroen commented Oct 19, 2020

In practice, at least in my cases, control characters in markdown are very rare and usually end up in the test by accident. In the real cases that I encountered, this happend when capturing stdout from a command line tool. An option to simply strip or substitute those works for me, a <control> tag introduce more complexity than it solves...

@jgm
Copy link
Member

jgm commented Oct 19, 2020

Well, for your purposes it might be fine to strip them. But we don't want to have two distinct ASTs represented by the same XML, if the XML purports to be the most accurate and direct representation of the AST. (There has been discussion of using the XML for tests, for example, and that wouldn't be possible if we stripped content.)

@jeroen
Copy link
Contributor Author

jeroen commented Oct 20, 2020

Well, yes and no, I think you can also argue that control C0 characters aren't part of the content in the first place. They have no textual meaning, just artifacts not for human consumption, like a BOM or EOF. For example, the json specification also states that a BOM character may be ignored by the parser.

Either way, any solution that allows me to reliable parse cmark_render_xml output in libxml2 works for me :-)

@jgm
Copy link
Member

jgm commented Oct 21, 2020

I think you can also argue that control C0 characters aren't part of the content in the first place

I think that would be a reasonable decision to make, but currently the spec doesn't say this.
Perhaps a change to the spec would be wise.

nwellnhof added a commit to nwellnhof/cmark that referenced this issue Feb 3, 2021
Control characters, U+FFFE and U+FFFF aren't allowed in XML 1.0, so
replace them with U+FFFD (replacement character). This doesn't solve
the problem how to roundtrip these characters, but at least we don't
produce invalid XML. See commonmark#365.
nwellnhof added a commit to nwellnhof/cmark that referenced this issue Feb 3, 2021
Control characters, U+FFFE and U+FFFF aren't allowed in XML 1.0, so
replace them with U+FFFD (replacement character). This doesn't solve
the problem how to roundtrip these characters, but at least we don't
produce invalid XML. See commonmark#365.
@nwellnhof
Copy link
Contributor

The pull request above should fix the issue for @jeroen.

We could also use a textual escape mechanism like \uXXXXXX which would also work with XML attributes. But for most practical purposes, simply replacing control and non-characters shouldn't be a problem.

jgm pushed a commit that referenced this issue Mar 18, 2021
Control characters, U+FFFE and U+FFFF aren't allowed in XML 1.0, so
replace them with U+FFFD (replacement character). This doesn't solve
the problem how to roundtrip these characters, but at least we don't
produce invalid XML. See #365.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants