Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ConvertFrom-Html parses special characters as question marks #7

Open
dominikduennebacke opened this issue Oct 3, 2021 · 2 comments · May be fixed by #12
Open

ConvertFrom-Html parses special characters as question marks #7

dominikduennebacke opened this issue Oct 3, 2021 · 2 comments · May be fixed by #12
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@dominikduennebacke
Copy link

Hi there. Really appreciate this module using PowerShell Core. Thank you for your work!

Scraping some European websites I came across an issue in regards to special characters, like ü, ä, ö, é, ß, etc.
Somehow ConvertFrom-Html cannot handle these characters and parses them as question marks. It seems to be related to the encoding which cannot be specified by any parameter.

Any ideas how to solve this?

Example

Invoke-WebRequest content show the "ü" character correctly

$Result = Invoke-WebRequest -Uri "https://www.compart.com/en/unicode/U+00FC"
$Result.Content -split "<" | Where-Object {$_ -like '*span class="box">*'}

>> span class="box">ü

ConvertFrom-Html parses that into "??"

$Html = ConvertFrom-Html -Content $Result
$Html.SelectNodes('//span[@class="box"]')

>> NodeType Name AttributeCount ChildNodeCount ContentLength InnerText
>> -------- ---- -------------- -------------- ------------- ---------
>> Element  span 1              1              2             ??

Return headers show correct content-type utf-8

$Result.Headers

>> Key             Value
>> ---             -----
>> Server          {nginx}
>> Date            {Sun, 03 Oct 2021 10:22:56 GMT}
>> Connection      {keep-alive}
>> X-Powered-By    {Express}
>> Accept-Ranges   {bytes}
>> Cache-Control   {public, max-age=0}
>> ETag            {W/"aabd-17a2d88a25f"}
>> X-Response-Time {0}
>> Vary            {Accept-Encoding}
>> Content-Type    {text/html; charset=utf-8}
>> Content-Length  {43709}
>> Last-Modified   {Mon, 21 Jun 2021 07:46:07 GMT}

Version info

$PSVersionTable

>> Name                           Value
>> ----                           -----
>> PSVersion                      7.1.3
>> PSEdition                      Core
>> GitCommitId                    7.1.3
>> OS                             Darwin 20.2.0 Darwin Kernel Version 20.2.0: Wed Dec  2 20:40:21 PST 2020; root:xnu-7195.60.75~1/RELEASE_ARM64_T8101
>> Platform                       Unix
>> PSCompatibleVersions           {1.0, 2.0, 3.0, 4.0…}
>> PSRemotingProtocolVersion      2.3
>> SerializationVersion           1.1.0.1
>> WSManStackVersion              3.0


Get-Module PowerHTML

>> ModuleType Version    PreRelease Name                                ExportedCommands
>> ---------- -------    ---------- ----                                ----------------
>> Script     0.1.7                 PowerHTML                           ConvertFrom-Html
@JustinGrote JustinGrote added enhancement New feature or request good first issue Good for newcomers labels Oct 5, 2021
@JustinGrote
Copy link
Owner

This appears to require the encoding to be modified:
https://html-agility-pack.net/knowledge-base/14793650/wrong-encoding-with-html-agility-pack

I am not going to have time in the near future to patch this in but I've flagged it as a hacktoberfest issue, maybe someone will be interested :)

@jantari
Copy link

jantari commented Oct 12, 2021

I looked into this as a Hacktoberfest opportunity, but unfortunately for me the problem is not in HTMLAgilityPack or PowerHTML.

$Result = Invoke-WebRequest -Uri "https://www.compart.com/en/unicode/U+00FC"

In this example, $Result is an object of type Microsoft.PowerShell.Commands.WebResponseObject:

$Result -is [Microsoft.PowerShell.Commands.WebResponseObject]

and when that's somewhat questionably passed to the -Content parameter it gets cast to a [string[]] which is allowed and works, however for some reason this conversion is what already mangles the encoding. Doing this manually results in the same ?? problem:

[string]$Result
# Same result with ToString, also wrong encoding:
$Result.ToString()

The way to get the proper encoded string from the WebResponseObject is to access the Content property:

$Html = ConvertFrom-Html -Content $Result.Content
$Html.SelectNodes('//span[@class="box"]')

The automatic lossy conversion PowerShell is doing here is unfortunate, but in the end an easy fix for the user.
I could make ConvertFrom-Html accept Microsoft.PowerShell.Commands.WebResponseObject objects directly for input maybe?
Then there would be no lossy casting to string happening.

pi-alexander-popel added a commit to postindustria-tech/device-detection-nginx that referenced this issue Jul 1, 2024
justadreamer pushed a commit to postindustria-tech/device-detection-nginx that referenced this issue Jul 1, 2024
trackd added a commit to trackd/PowerHTML that referenced this issue Jul 5, 2024
@trackd trackd linked a pull request Jul 5, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants