Consider switching from lxml's clean_html for enhanced security (and possibly performance) #558

frenzymadness · 2023-08-30T08:31:25Z

I'd like to bring to your attention that we are discussing the possibility of removing lxml's clean_html functionality from lxml library. Over the past years, there have been several concerning security vulnerabilities discovered within the lxml library's clean_html functionality – CVE-2021-43818, CVE-2021-28957, CVE-2020-27783, CVE-2018-19787 and CVE-2014-3146.

The main problem is in the design. Because the lxml's clean_html functionality is based on a blocklist, it's hard to keep it up to date with all new possibilities in HTML and JS.

Two viable alternatives worth considering are bleach and nh3. Here's why:

bleach:

Bleach is a widely adopted Python library specifically designed for sanitizing and cleaning HTML input.
It has a strong track record in terms of security – it's allowed-list-based.
It was deprecated in January but it will still receive security updates, support for new Pythons and bugfixes, see upstream issue.

nh3:

nh3 is Python binding for the ammonia library. Ammonia is written in Rust and it's also allowed-list-based.
Thanks to the Rust backend, nh3 is also significantly faster than bleach.
Rust backend is nothing to be afraid of. nh3 uses the latest PyO3 compatible with Python 3.12 and provides wheels built on top of compatible ABI for different architectures and platforms.

We'll probably move the cleaning part of the lxml to a distinct project first so it will still be possible to use it but better is to find a suitable alternative sooner rather than later.

Let me know if we can help you with this transition anyhow and have a nice day.

The text was updated successfully, but these errors were encountered:

frenzymadness · 2024-04-02T08:53:16Z

Just an update on this. The latest version of lxml (5.2.0) no longer contains the HTML cleaner. Its code is now available as a dedicated project on GitHub and PyPI.

If you want to continue using it, you can either depend on lxml[html_clean] or on lxml_html_clean directly. lxml contains backward-compatible import so there is nothing else you need to change than the dependency.

For context, see: psf/requests-html#558 (comment) psf/requests-html#569 (comment)

* Fixes excessive line wrapping. * Adds changelog update. * Reruns with current `black`. * Adds lxml_html_clean dependency. For context, see: psf/requests-html#558 (comment) psf/requests-html#569 (comment) * [nan] Removes specialized selector. It seems that Min Nan has been totally removed from English Wiktionary. * updates changelog * Adds new dependency to pyproject.toml too.

lxml-html-clean dependency since it currently breaks tests due to being migrated to a separate module. We're not currently using it, so omitting the import is fine. Should its functionality be needed again, the conda package "lxml-html-clean" will need to be installed. See: https://bugs.launchpad.net/lxml/+bug/1958539 psf/requests-html#558

psobolewskiPhD mentioned this issue Apr 1, 2024

ImportError: lxml.html.clean module is now a separate project lxml_html_clean over test suite (Ubuntu 20.04 and benchmarks) napari/napari#6798

Closed

hugovk mentioned this issue Apr 2, 2024

LXML 5.2.0 breaks import #569

Open

kylebgorman added a commit to kylebgorman/wikipron that referenced this issue Apr 8, 2024

Adds lxml_html_clean dependency.

21054c3

For context, see: psf/requests-html#558 (comment) psf/requests-html#569 (comment)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider switching from lxml's clean_html for enhanced security (and possibly performance) #558

Consider switching from lxml's clean_html for enhanced security (and possibly performance) #558

frenzymadness commented Aug 30, 2023

frenzymadness commented Apr 2, 2024

Consider switching from lxml's clean_html for enhanced security (and possibly performance) #558

Consider switching from lxml's clean_html for enhanced security (and possibly performance) #558

Comments

frenzymadness commented Aug 30, 2023

frenzymadness commented Apr 2, 2024