Skip to content

Commit

Permalink
Fix fragment quirks mode, and make it easier to specify fragment cont…
Browse files Browse the repository at this point in the history
…ext (#3246)

**What problem is this PR intended to solve?**

Coming from the discussion at #3203, I wanted to improve the fragment
parsing API

- as discussed in #2646, parse in no-quirks mode if a context element
name is provided (not a context `Node`, just the name)
- allow passing `:context` kwarg to `DocumentFragment.new` and `.parse`
- deprecate the positional options hash to `.parse` per notes at #3200


**Have you included adequate test coverage?**

Included additional coverage for the API changes


**Does this change affect the behavior of either the C or the Java
implementations?**

HTML5 is only in CRuby.
  • Loading branch information
flavorjones authored Jun 22, 2024
2 parents 8e03cc8 + eca68e7 commit cc78398
Show file tree
Hide file tree
Showing 7 changed files with 114 additions and 19 deletions.
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ Nokogiri follows [Semantic Versioning](https://semver.org/), please see the [REA
* Documentation has been improved for `CSS.xpath_for`. [#3224] @flavorjones
* [CRuby] When compiling packaged libraries from source, allow users' `AR` and `LD` environment variables to set the archiver and linker commands, respectively. This augments the existing `CC` environment variable to set the compiler command. [#3165] @ziggythehamster
* [CRuby] The HTML5 parse methods accept a `:parse_noscript_content_as_text` keyword argument which will emulate the parsing behavior of a browser which has scripting enabled. [#3178, #3231] @stevecheckoway
* [CRuby] `HTML5::DocumentFragment.parse` and `.new` accept a `:context` keyword argument that is the parse context node or element name. Previously this could only be passed in as a positional argument to `.new` and not at all to `.parse`. @flavorjones


### Fixed
Expand All @@ -49,6 +50,7 @@ Nokogiri follows [Semantic Versioning](https://semver.org/), please see the [REA

* The undocumented and unused method `Nokogiri::CSS.parse` is now deprecated and will generate a warning. The AST returned by this method is private and subject to change and removal in future versions of Nokogiri. This method will be removed in a future version of Nokogiri.
* Passing an options hash to `CSS.xpath_for` is now deprecated and will generate a warning. Use keyword arguments instead. This will become an error in a future version of Nokogiri.
* Passing an options hash to `HTML5::DocumentFragment.parse` is now deprecated and will generate a warning. Use keyword arguments instead. This will become an error in a future version of Nokogiri.


## v1.16.6 / 2024-06-13
Expand Down
2 changes: 1 addition & 1 deletion ext/nokogiri/gumbo.c
Original file line number Diff line number Diff line change
Expand Up @@ -538,7 +538,7 @@ noko_gumbo_s_fragment(int argc, VALUE *argv, VALUE _self)
VALUE doc = rb_funcall(doc_fragment, rb_intern_const("document"), 0);
VALUE dtd = rb_funcall(doc, internal_subset, 0);
VALUE doc_quirks_mode = rb_iv_get(doc, "@quirks_mode");
if (NIL_P(ctx) || NIL_P(doc_quirks_mode)) {
if (NIL_P(ctx) || (TYPE(ctx) == T_STRING) || NIL_P(doc_quirks_mode)) {
quirks_mode = GUMBO_DOCTYPE_NO_QUIRKS;
} else if (NIL_P(dtd)) {
quirks_mode = GUMBO_DOCTYPE_QUIRKS;
Expand Down
2 changes: 1 addition & 1 deletion lib/nokogiri/css.rb
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,7 @@ def xpath_for(
cache: true
)
unless options.nil?
warn("Passing options as an explicit hash is deprecated. Use keyword arguments instead. This will become an error in a future release.", uplevel: 1, category: :deprecated)
warn("Nokogiri::CSS.xpath_for: Passing options as an explicit hash is deprecated. Use keyword arguments instead. This will become an error in a future release.", uplevel: 1, category: :deprecated)
end

raise(TypeError, "no implicit conversion of #{selector.inspect} to String") unless selector.respond_to?(:to_str)
Expand Down
12 changes: 7 additions & 5 deletions lib/nokogiri/html5.rb
Original file line number Diff line number Diff line change
Expand Up @@ -249,11 +249,13 @@ def self.HTML5(input, url = nil, encoding = nil, **options, &block)
#
# == Notes
#
# * The Nokogiri::HTML5.fragment function takes a string and parses it as a HTML5 document. The
# +html+, +head+, and +body+ elements are removed from this document, and any children of these
# elements that remain are returned as a Nokogiri::HTML5::DocumentFragment.
# * The Nokogiri::HTML5.fragment function takes a String or IO and parses it as a HTML5 document
# in a +body+ context. As a result, the +html+, +head+, and +body+ elements are removed from
# this document, and any children of these elements that remain are returned as a
# Nokogiri::HTML5::DocumentFragment; but you can pass in a different context (e.g., "html" to
# get +head+ and +body+ tags in the result).
#
# * The Nokogiri::HTML5.parse function takes a string and passes it to the
# * The Nokogiri::HTML5.parse function takes a String or IO and passes it to the
# <code>gumbo_parse_with_options</code> method, using the default options. The resulting Gumbo
# parse tree is then walked.
#
Expand All @@ -273,7 +275,7 @@ def parse(string, url = nil, encoding = nil, **options, &block)
# Parse a fragment from +string+. Convenience method for
# Nokogiri::HTML5::DocumentFragment.parse.
def fragment(string, encoding = nil, **options)
DocumentFragment.parse(string, encoding, options)
DocumentFragment.parse(string, encoding, **options)
end

# :nodoc:
Expand Down
59 changes: 47 additions & 12 deletions lib/nokogiri/html5/document_fragment.rb
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,47 @@ module HTML5
#
# 💡 HTML5 functionality is not available when running JRuby.
class DocumentFragment < Nokogiri::HTML4::DocumentFragment
class << self
# :call-seq:
# parse(tags, **options)
# parse(tags, encoding = nil, **options)
#
# Parse an HTML5 document fragment from +tags+, returning a Nodeset.
#
# [Parameters]
# - +tags+ [String, IO] The HTML5 document fragment to parse.
# - +encoding+ [String] The name of the encoding to use when parsing the document fragment. (default +nil+)
#
# Also see Nokogiri::HTML5 for a longer explanation of how encoding is handled by the parser.
#
# [Options]
# - +:context+ [String, Nokogiri::XML::Node] The context in which to parse the document fragment. (default +"body"+)
# - +:max_errors+ [Integer] The maximum number of parse errors to record. (default +Nokogiri::Gumbo::DEFAULT_MAX_ERRORS+ which is currently 0)
# - +:max_tree_depth+ [Integer] The maximum depth of the parse tree. (default +Nokogiri::Gumbo::DEFAULT_MAX_TREE_DEPTH+)
# - +:max_attributes+ [Integer] The maximum number of attributes allowed on an element. (default +Nokogiri::Gumbo::DEFAULT_MAX_ATTRIBUTES+)
# - +:parse_noscript_content_as_text+ [Boolean] Whether to parse the content of +noscript+ elements as text. (default +false+)
#
# Also see Nokogiri::HTML5 for a longer explanation of the options.
#
# [Returns]
# - [Nokogiri::XML::NodeSet] A node set containing the root nodes of the parsed fragment.
#
def parse(tags, encoding = nil, positional_options_hash = nil, **options)
unless positional_options_hash.nil?
warn("Nokogiri::HTML5::DocumentFragment.parse: Passing options as an explicit hash is deprecated. Use keyword arguments instead. This will become an error in a future release.", uplevel: 1, category: :deprecated)
options.merge!(positional_options_hash)
end

context = options.delete(:context)

document = HTML5::Document.new
document.encoding = "UTF-8"
tags = HTML5.read_and_encode(tags, encoding)

new(document, tags, context, options)
end
end

attr_accessor :document
attr_accessor :errors

Expand All @@ -36,18 +77,20 @@ class DocumentFragment < Nokogiri::HTML4::DocumentFragment
attr_reader :quirks_mode

# Create a document fragment.
def initialize(doc, tags = nil, ctx = nil, options = {}) # rubocop:disable Lint/MissingSuper
self.document = doc
self.errors = []
def initialize(doc, tags = nil, context = nil, options = {}) # rubocop:disable Lint/MissingSuper
@document = doc
@errors = []
return self unless tags

tags = Nokogiri::HTML5.read_and_encode(tags, nil)

context = options.delete(:context) if options.key?(:context)

options[:max_attributes] ||= Nokogiri::Gumbo::DEFAULT_MAX_ATTRIBUTES
options[:max_errors] ||= options.delete(:max_parse_errors) || Nokogiri::Gumbo::DEFAULT_MAX_ERRORS
options[:max_tree_depth] ||= Nokogiri::Gumbo::DEFAULT_MAX_TREE_DEPTH

Nokogiri::Gumbo.fragment(self, tags, ctx, **options)
Nokogiri::Gumbo.fragment(self, tags, context, **options)
end

def serialize(options = {}, &block) # :nodoc:
Expand All @@ -56,14 +99,6 @@ def serialize(options = {}, &block) # :nodoc:
XML::Node.instance_method(:serialize).bind_call(self, options, &block)
end

# Parse a document fragment from +tags+, returning a Nodeset.
def self.parse(tags, encoding = nil, options = {})
doc = HTML5::Document.new
tags = HTML5.read_and_encode(tags, encoding)
doc.encoding = "UTF-8"
new(doc, tags, nil, options)
end

def extract_params(params) # :nodoc:
handler = params.find do |param|
![Hash, String, Symbol].include?(param.class)
Expand Down
38 changes: 38 additions & 0 deletions test/html5/test_api.rb
Original file line number Diff line number Diff line change
Expand Up @@ -398,6 +398,44 @@ def initialize(*args)
end

describe Nokogiri::HTML5::DocumentFragment do
describe "passing in context node" do
it "to DocumentFragment.new" do
fragment = Nokogiri::HTML5::DocumentFragment.new(
Nokogiri::HTML5::Document.new,
"<body><div>foo</div></body>",
"html",
)
assert_match(/<body>/, fragment.to_s)
assert_match(/<head>/, fragment.to_s)
end

describe "to DocumentFragment.parse" do
it "as an options hash" do
assert_output(nil, /Passing options as an explicit hash is deprecated/) do
fragment = Nokogiri::HTML5::DocumentFragment.parse(
"<body><div>foo</div></body>",
nil,
{ context: "html" },
)
assert_match(/<body>/, fragment.to_s)
assert_match(/<head>/, fragment.to_s)
end
end

it "as keyword argument" do
fragment = Nokogiri::HTML5::DocumentFragment.parse("<body><div>foo</div></body>", context: "html")
assert_match(/<body>/, fragment.to_s)
assert_match(/<head>/, fragment.to_s)
end
end

it "to HTML5.fragment" do
fragment = Nokogiri::HTML5.fragment("<body><div>foo</div></body>", context: "html")
assert_match(/<body>/, fragment.to_s)
assert_match(/<head>/, fragment.to_s)
end
end

describe "subclassing" do
let(:klass) do
Class.new(Nokogiri::HTML5::DocumentFragment) do
Expand Down
18 changes: 18 additions & 0 deletions test/html5/test_quirks_mode.rb
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,15 @@
assert_equal(Nokogiri::HTML5::QuirksMode::NO_QUIRKS, fragment.quirks_mode)
assert_equal(no_quirks_output, fragment.to_html)
end

describe "context node is just a tag name and not a real node" do
it "parses the fragment in no-quirks mode" do
fragment = Nokogiri::HTML5::DocumentFragment.new(document, input, "body")

assert_equal(Nokogiri::HTML5::QuirksMode::NO_QUIRKS, fragment.quirks_mode)
assert_equal(no_quirks_output, fragment.to_html)
end
end
end

describe "document does not have a doctype" do
Expand All @@ -78,6 +87,15 @@
assert_equal(Nokogiri::HTML5::QuirksMode::QUIRKS, fragment.quirks_mode)
assert_equal(quirks_output, fragment.to_html)
end

describe "context node is just a tag name and not a real node" do
it "parses the fragment in no-quirks mode" do
fragment = Nokogiri::HTML5::DocumentFragment.new(document, input, "body")

assert_equal(Nokogiri::HTML5::QuirksMode::NO_QUIRKS, fragment.quirks_mode)
assert_equal(no_quirks_output, fragment.to_html)
end
end
end
end

Expand Down

0 comments on commit cc78398

Please sign in to comment.