-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ConvertFrom-Html parses special characters as question marks #7
Comments
This appears to require the encoding to be modified: I am not going to have time in the near future to patch this in but I've flagged it as a hacktoberfest issue, maybe someone will be interested :) |
I looked into this as a Hacktoberfest opportunity, but unfortunately for me the problem is not in HTMLAgilityPack or PowerHTML. $Result = Invoke-WebRequest -Uri "https://www.compart.com/en/unicode/U+00FC" In this example, $Result -is [Microsoft.PowerShell.Commands.WebResponseObject] and when that's somewhat questionably passed to the [string]$Result
# Same result with ToString, also wrong encoding:
$Result.ToString() The way to get the proper encoded string from the WebResponseObject is to access the $Html = ConvertFrom-Html -Content $Result.Content
$Html.SelectNodes('//span[@class="box"]') The automatic lossy conversion PowerShell is doing here is unfortunate, but in the end an easy fix for the user. |
Hi there. Really appreciate this module using PowerShell Core. Thank you for your work!
Scraping some European websites I came across an issue in regards to special characters, like ü, ä, ö, é, ß, etc.
Somehow ConvertFrom-Html cannot handle these characters and parses them as question marks. It seems to be related to the encoding which cannot be specified by any parameter.
Any ideas how to solve this?
Example
Invoke-WebRequest content show the "ü" character correctly
ConvertFrom-Html parses that into "??"
Return headers show correct content-type utf-8
Version info
The text was updated successfully, but these errors were encountered: