You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
To determine the logical path for pages backed by a file, Hugo starts with the file path, relative to the content directory, and then:
Strips the file extension
Strips the language identifier
Converts the result to lower case
Replaces spaces with hyphens
The value returned by the Path method on a Page object is independent of content format, language, and URL modifiers such as the slug and url front matter fields.
Nowhere in these 4 steps does it say anything about removing URL-safe characters entirely. This issue seems to occur for some URL-safe characters, but not all. These are the ones that work:
content/foo$bar.md => /foobar/ (incorrect). /foo$bar/ leads to a Page Not Found.
content/foo!bar.md => /foobar/ (incorrect). /foo!bar/ leads to a Page Not Found.
content/foo*bar.md => /foobar/ (incorrect). /foo*bar/ leads to a Page Not Found.
content/foo'bar.md => /foobar/ (incorrect). /foo'bar/ leads to a Page Not Found.
content/foo(bar.md => /foobar/ (incorrect). /foo(bar/ leads to a Page Not Found.
content/foo)bar.md => /foobar/ (incorrect). /foo)bar/ leads to a Page Not Found.
content/foo;bar.md => /foobar/ (incorrect). /foo;bar/ leads to a Page Not Found.
content/foo=bar.md => /foobar/ (incorrect). /foo=bar/ leads to a Page Not Found.
content/foo:bar.md => /foobar/ (incorrect). /foo:bar/ leads to a Page Not Found.
content/foo[bar.md => /foobar/ (incorrect). /foo[bar/ leads to a Page Not Found.
content/foo]bar.md => /foobar/ (incorrect). /foo]bar/ leads to a Page Not Found.
content/foo&bar.md => /foobar/ (incorrect). /foo&bar/ leads to a Page Not Found.
content/foo,bar.md => /foobar/ (incorrect). /foo,bar/ leads to a Page Not Found.
Again, I would expect that URL-safe characters are preserved, because there is nothing to suggest that they should be removed. Out of the "reserved" characters per the URI RFC (: / ? # [ ] @ ! $ & ' ( ) * + , ; =), we can probably eliminate characters that have special semantics in HTTP(S), like / for path segment separation, ? for query components, and # for fragment components. But most everything else should be fine to use in the path component. It feels weird to allow + or @, but not ! or $.
The correct title of this article is $#*! My Dad Says. The substitution of the # is due to technical restrictions.
Proposed resolution
Stop removing URL-safe characters from the Page.path, or at least make it clear (or better yet, configurable) which characters will be removed and which ones won't.
The text was updated successfully, but these errors were encountered:
This means that the only special characters that should be percent-encoded are the gen-delims, aside from : and @. Everything else is fair game.
encode-these = "/" / "?" / "#" / "[" / "]"
The first 3 are self-explanatory (delimiters for path segments, query component, and fragment component). The last 2 (square brackets) are specifically disallowed outside the authority component (where they are intended to be used with IPv6 literals), but still get used in the wild sometimes.
What version of Hugo are you using (
hugo version
)?Does this issue reproduce with the latest release?
yes
Steps to reproduce
path
includes an exclamation mark (e.g.yu-gi-oh!
)Expected behavior
The page is created with the url including the exclamation mark (e.g.
/tags/yu-gi-oh!
)Actual behavior
The exclamation mark seems to be sanitized out (e.g.
/tags/yu-gi-oh
, creating a path conflict with an already existing page)Additional information
https://gohugo.io/methods/page/path/ describes the following:
Nowhere in these 4 steps does it say anything about removing URL-safe characters entirely. This issue seems to occur for some URL-safe characters, but not all. These are the ones that work:
content/foo-bar.md
=>/foo-bar/
(correct).content/foo_bar.md
=>/foo_bar/
(correct).content/foo.bar.md
=>/foo.bar/
(correct).content/foo+bar.md
=>/foo+bar/
(correct).content/[email protected]
=>/foo@bar/
(correct).content/foo~bar.md
=>/foo~bar/
(correct).And these are the ones that don't work:
content/foo$bar.md
=>/foobar/
(incorrect)./foo$bar/
leads to a Page Not Found.content/foo!bar.md
=>/foobar/
(incorrect)./foo!bar/
leads to a Page Not Found.content/foo*bar.md
=>/foobar/
(incorrect)./foo*bar/
leads to a Page Not Found.content/foo'bar.md
=>/foobar/
(incorrect)./foo'bar/
leads to a Page Not Found.content/foo(bar.md
=>/foobar/
(incorrect)./foo(bar/
leads to a Page Not Found.content/foo)bar.md
=>/foobar/
(incorrect)./foo)bar/
leads to a Page Not Found.content/foo;bar.md
=>/foobar/
(incorrect)./foo;bar/
leads to a Page Not Found.content/foo=bar.md
=>/foobar/
(incorrect)./foo=bar/
leads to a Page Not Found.content/foo:bar.md
=>/foobar/
(incorrect)./foo:bar/
leads to a Page Not Found.content/foo[bar.md
=>/foobar/
(incorrect)./foo[bar/
leads to a Page Not Found.content/foo]bar.md
=>/foobar/
(incorrect)./foo]bar/
leads to a Page Not Found.content/foo&bar.md
=>/foobar/
(incorrect)./foo&bar/
leads to a Page Not Found.content/foo,bar.md
=>/foobar/
(incorrect)./foo,bar/
leads to a Page Not Found.Again, I would expect that URL-safe characters are preserved, because there is nothing to suggest that they should be removed. Out of the "reserved" characters per the URI RFC (
: / ? # [ ] @ ! $ & ' ( ) * + , ; =
), we can probably eliminate characters that have special semantics in HTTP(S), like/
for path segment separation,?
for query components, and#
for fragment components. But most everything else should be fine to use in the path component. It feels weird to allow+
or@
, but not!
or$
.Example URLs with special characters in them:
Proposed resolution
Stop removing URL-safe characters from the Page.path, or at least make it clear (or better yet, configurable) which characters will be removed and which ones won't.
The text was updated successfully, but these errors were encountered: