Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement parsing extensions #123

Open
wants to merge 20 commits into
base: master
Choose a base branch
from

Conversation

MathieuDuponchelle
Copy link
Contributor

Reference discussion: #100

I just rebased the branch on master and tested it, make passes with no new warnings all along the branch, valgrinding a table test shows no leaks.

I applied the initial review comments from @jgm and @nwellnhof , which were to not expose cmark_strbuf, which is thus included in the sources of the reference "core extension", and to not expose API using cmark's bool type, which is replaced by integers, rendering installation of config.h unnecessary.

The design implemented is the second one I proposed in the above discussion, here's a brief summary:

  • A new structure is exposed, cmark_syntax_extension, which exposes hooks that match the block and inline parsing strategies outlined in http://spec.commonmark.org/0.25/#appendix-a-parsing-strategy .

  • This proposal does not cover the implementation of new block types, instead syntax extensions need to create blocks with types defined by cmark itself, including CMARK_NODE_CUSTOM_BLOCK and CMARK_NODE_CUSTOM_INLINE, this already covers a wide range of use cases.

    That is because in my opinion, we want cmark to guarantee that the output AST will always be correctly structured (with the exception of custom inlines and blocks), with no "impossible" or "maningless" block nesting situations.

    An interesting enhancement would be to refactor the custom inline and block types to let extensions provide virtual methods such as "can_contain" for them, but this will be somehow fragile as "inter-extension negotiation" will not be possible.

  • An important part of this proposal is that cmark implement and expose block types for which it does not yet have a formal syntax specification, such as tables or strikethroughs. I believe this is desirable because this will allow multiple syntax extensions to compete for a given desired output, without tying cmark's specification to a specific syntax, while still making sure the produced output is correct. It makes sense when looking at it as a testing ground if you will.

It is important to state that this design doesn't prevent us from implementing negotiation in the future if we so desire.

I think this branch is now approaching what I'd consider mergeable quality, with the exception of the plugin code, which I think is mostly correct on Linux, but will not work on Windows. If @nwellnhof is interested in tackling the abstraction for Windows, then all the best, otherwise the code can be merged without plugin loading / discovery for now, as it is already useful for people using cmark as a library (such as I).

Last thing, this last rebase round has been pretty involved, and I'd very much like (if we all agree that extensions are desirable) that merging this branch be made a priority, I will of course address reviews in a timely manner.

@MathieuDuponchelle
Copy link
Contributor Author

yay MSVC \o/ I'll have a look at the issues tomorrow, hoping it does not just uncover a new layer :)

src/cmark.h Outdated
cmark_node*cmark_parser_add_child(cmark_parser *parser,
cmark_node *parent,
cmark_node_type block_type,
int start_column);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't extensions simply use cmark_node_append_child?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I suggest you look at the implementation of add_child in src/blocks.c , this function finalizes open parent blocks, and creates the new block as open. This is really a parsing-specific method, and it cannot be replaced by cmark_node_append_child

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved this to a separate public header

@nwellnhof
Copy link
Contributor

Some general notes:

This proposal does not cover the implementation of new block types, instead syntax extensions need to create blocks with types defined by cmark itself, including CMARK_NODE_CUSTOM_BLOCK and CMARK_NODE_CUSTOM_INLINE, this already covers a wide range of use cases.

But that's not what your table and strikethrough extensions do. They define their own block types in the cmark core. If your extension nodes used the custom node types, it would be more convincing that this works for other extensions, too.

Regarding dynamic loading and plugin discovery: I think we should leave this out for now, and only have a function to register an extension with a parser.

It might also be a good idea to move everything related to extensions to a separate public header file. There are some functions like cmark_set_node_type or cmark_set_string_content that are unsafe in their current form and should only be used by extension authors. If we put such functions into cmark.h, people will start using them and complain if things start to break. The extension header file should come with a warning that this is a semi-public API only for extension authors.

@MathieuDuponchelle
Copy link
Contributor Author

MathieuDuponchelle commented Apr 26, 2016

Thanks for the review, answering the general comment here:

But that's not what your table and strikethrough extensions do. They define their own block types in the cmark core. If your extension nodes used the custom node types, it would be more convincing that this works for other extensions, too.

That is what my extensions do, they don't define their own block types, cmark does, it so happens that the commits for adding some of them live in the same branch. For reference here's the code for a bunch of other extensions:

https://github.com/hotdoc/hotdoc/blob/master/hotdoc/parsers/cmark_gtkdoc_extension.c
https://github.com/hotdoc/hotdoc/blob/master/hotdoc/parsers/cmark_include_scanner.c

If you look at https://github.com/hotdoc/hotdoc/blob/master/hotdoc/parsers/cmark_gtkdoc_extension.c#L256 for example, you will see that this extension instantiates a regular CMARK_NODE_CODE_BLOCK, it's a totally legit use case for the syntax extensions design I propose to create standard block types according to their own syntax rules, this makes them robust (block precedence is enforced by cmark's core) and concise (rendering for all possible output formats is handled by cmark's core)

The key element in this proposed design is that extensions do not define their own block types, as I believe there is no satisfying way to ensure inter-extension block precedence / can_contain rules. That is also why my extensions do not use the "CMARK_NODE_CUSTOM_*" types, as I am not interested in a fragile solution, or in implementing format-specific rendering code. There is also no actual correlation between the "convicingness" (is that a word?) of my proposal and the fact that my extensions (do not) use these types.

Note also that the two new block types I implemented in cmark are two extensions for which there is (or at least seems to be) a clear support intention for upstream in the future. I think it can't really hurt to implement the rendering code for them ahead of definitive syntax rules, quite the contrary actually.

Regarding dynamic loading and plugin discovery: I think we should leave this out for now, and only have a function to register an extension with a parser.

ack, I'll split that out

It might also be a good idea to move everything related to extensions to a separate public header file. There are some functions like cmark_set_node_type or cmark_set_string_content that are unsafe in their current form and should only be used by extension authors. If we put such functions into cmark.h, people will start using them and complain if things start to break. The extension header file should come with a warning that this is a semi-public API only for extension authors.

That could make sense, along with an "unstable" warning, and maybe a preprocessor directive to enable it (CMARK_ENABLE_EXTENSION_API or something like that)

I would be interested in @jgm's opinion on that last point

Last thing, @nwellnhof I noticed you made a few comments on outdated diffs, and intentionally didn't address these, can you make these comments again on the current diff?

I've had a lot of work today, and what little time I had to work on cmark I spent addressing this (very appreciated) review, I will make the agreed-upon changes tomorrow (I hope :)

Thanks!

@MathieuDuponchelle
Copy link
Contributor Author

MathieuDuponchelle commented Apr 27, 2016

I apologize for the noise here (though github should apologize really), I stick to a rebase workflow because I prefer having a clean history to work with, instead of piling up commits on top of my branch.

EDIT: seems it sort of fixed itself, at the time I wrote that github was displaying the original commits and their multiple new versions

@MathieuDuponchelle
Copy link
Contributor Author

I have addressed all the review comments, either correcting the issues that were pointed out or answering the comments.

This branch now installs a new public header "cmark_extension_api.h", separate from cmark.h, to be included and used by syntax extension writers.

Additionally, I've also separated the plugin loading and discovery code to its own commit, however after thinking back on it I'd really like @nwellnhof to come true on the windows port promise: this is a pretty important part of the functionality that you're asking me to take away from the branch, and you have pretty vehemently protested my proposal of using the GLib, which would have offered us a portable and well-tested plugin interface.

It would be a bit ugly to oppose this proposal, state that you can port the code to Windows, then just ask me to drop it when comes the time to actually do what you promised.

If you do not have time to work on that, I propose we merge the linux implementation to start with.

@jgm, do you have time for reviewing this branch these days?

@markhalliwell
Copy link

There is now a PHP PECL extension of cmark (https://github.com/krakjoe/cmark), but in order to utilize it within https://github.com/thephpleague/commonmark (which is highly extensible), the underlying libcmark library needs to support this.

It would be nice to get some traction on this again.

@MathieuDuponchelle
Copy link
Contributor Author

@MarkCarver I completely agree that this would be lovely, but we'd need to get @jgm back in here, and validate the design decision made by this PR, which was to implement support for output node types (eg tables) without an official agreed upon input syntax. I am still pretty convinced this is a safe enough decision, but well :)

@andrewshadura
Copy link

Any updates on this?

@MathieuDuponchelle
Copy link
Contributor Author

ping @jgm

@jgm
Copy link
Member

jgm commented Dec 4, 2019

Sorry, I just haven't had the bandwidth to devote to this. This extension mechanism has been implemented in github's fork of cmark. Those who need extensions can get them there. Since this library purports to be the reference implementation, and it needs to be updated every time the spec is changed, it makes sense to me to keep it simple and not worry about extensions for now at least....

@MathieuDuponchelle
Copy link
Contributor Author

MathieuDuponchelle commented Dec 4, 2019

Well let's just close this then.

Edit: it's a bit annoying that a fork was the solution to this

@markhalliwell
Copy link

Please don't close this issue. I still believe it's needed as a primary feature in cmark, regardless of how long it may take.

Offering https://github.com/github/cmark-gfm as a "solution" is really much of a solution for https://github.com/krakjoe/cmark or https://github.com/thephpleague/commonmark

Having the ability to extend cmark should be a global ability, not some GFM implementation.

@jgm
Copy link
Member

jgm commented Dec 4, 2019

I'd also favor keeping this open. I'm not saying the fork was the solution. I'm just saying that, for now, I don't have the time to deal with it, and keeping the profile of this project slimmer makes it easier for me to keep up with it.

@jgm
Copy link
Member

jgm commented Dec 4, 2019

I'm open to merging this, really. But it's a huge patch and surveying it is not simple. And I'm reluctant to ask @MathieuDuponchelle to rebase it again so it can be reviewed, given my past footdragging on this. If someone else wants to try to get it in shape, I can take a look, but no guarantees.

@markhalliwell
Copy link

I think it would make more sense to split this PR up, i.e. the extension API(s) as a single PR and then the actual subsequent extension implementations into their own individual PRs?

@MathieuDuponchelle
Copy link
Contributor Author

reopened, I had ideas on how to make this more palatable and if there's interest I can give it a go again at some point, I don't think after 3 years there is much of an urgency anyway :)

@MathieuDuponchelle
Copy link
Contributor Author

I think it would make more sense to split this PR up, i.e. the extension API(s) as a single PR and then the actual subsequent extension implementations into their own individual PRs?

I think the extension API is easier to review alongside extension examples tbh, however I can rework this to disallow mixing / matching multiple extensions from different origins, thus allowing extensions to define their own node types in a "safe" manner, that's basically what github ended up doing iirc

@jgm
Copy link
Member

jgm commented Dec 4, 2019

Is linking in dl going to cause problems for people who are trying to compile the library statically (e.g. on musl) and use it elsewhere?

@jgm
Copy link
Member

jgm commented Dec 4, 2019

Did GitHub keep the dynamic linking aspect of this, I don't recall?

@MathieuDuponchelle
Copy link
Contributor Author

Is linking in dl going to cause problems for people who are trying to compile the library statically (e.g. on musl) and use it elsewhere?

My plan would get rid of the dl linking entirely

@MathieuDuponchelle
Copy link
Contributor Author

MathieuDuponchelle commented Dec 4, 2019

Did GitHub keep the dynamic linking aspect of this, I don't recall?

I think they didn't, but that was a long time ago :) The problem without dynamic linking is that you end up with two solutions for people who want extensions:

  • Maintain their own cmark fork, compile in their extensions (meh)

  • Maintain extensions in cmark itself (more maintaining burden on the cmark side, but you're at least guaranteed you won't break extensions and it makes users life easier)

@jgm
Copy link
Member

jgm commented Dec 4, 2019

Can you clarify this? You want to avoid linking in dl, but still have dynamic plugins? How is it done? (I'm sorry, I've forgotten a lot of the context.)

@MathieuDuponchelle
Copy link
Contributor Author

Can you clarify this? You want to avoid linking in dl, but still have dynamic plugins? How is it done? (I'm sorry, I've forgotten a lot of the context.)

No, those two potential solutions were in the event that cmark wouldn't allow for third-party / dynamic plugins, hence not use dl / equivalent for other platforms.

I'm on the fence as to whether I want dynamic extensions or not, as in my specific case I have extensions for which it would make no sense to maintain them in cmark itself (eg, inline links to symbol names with #foo or foo() for function names, this is syntax from gtk-doc which I need to support). This means that without dynamic plugins, my only option will be to keep maintaining a fork, although with limited chances for conflicts when rebasing, which is already an improvement compared to my current situation :)

@jgm
Copy link
Member

jgm commented Dec 5, 2019

inline links to symbol names with #foo or foo() for function names

I'm pretty sure you don't need anything as heavyweight as an extension to support these. These could be handled with a transformation of the AST after parsing, no?

@MathieuDuponchelle
Copy link
Contributor Author

I'm pretty sure you don't need anything as heavyweight as an extension to support these. These could be handled with a transformation of the AST after parsing, no?

nope, not if I care about having proper parsing at least :) I don't remember the exact details, but a syntax extension is the correct approach here. Besides, these are just examples really, and syntax extensions are not heavyweight, they're actually more performant than whatever a posteriori, fragile AST parsing I could come up with.

CyberShadow pushed a commit to CyberShadow/cmark that referenced this pull request Apr 1, 2021
* default to safe

* fix setter test
talum pushed a commit to github/cmark-gfm that referenced this pull request Sep 14, 2021
This is too strict, as it prevents the use of dynamically
loaded extensions: see
commonmark#123 (comment).

Documented in man page and public header that one should use the same
memory allocator for every node in a tree.
talum pushed a commit to github/cmark-gfm that referenced this pull request Sep 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants