-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable the huge_tree
option for the lxml
parser
#3365
base: main
Are you sure you want to change the base?
Conversation
@KwisatzHaderach Are you able to somehow test this change and let me know if it resolves your problem? The processing of huge files using LXML can take a lot of resources (esp. CPU). We can always add an option/condition to completely bypass the LXML processing and use just Jinja2. The LXML is used to just check the XML schema for non-custom JUnit flavors and for prettifying the XML output. |
@seberm fine with me, the XML is not something user can easily inject I assume, if he wants to DOS us he has a lot of other options directly from the tests anyway. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
didn't get a chance to test it yet, but approving since it looks good and we need this sooner rather than later
So I gave it an another round of tests. Let's take following test as an example:
As you can see from the results below, enabling the 1) LXML completely bypassed (uninstalled), using just Jinja to generate the JUnitTakes around 43s on my machine to generate, but it works.
2) LXML enabled (schema validation + pretty print),
|
2a7c422
to
5317bc7
Compare
huge_tree
option for the lxml
parser
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand it correctly, we want large depth (above 256) to require huge_tree.
We played with it locally with:
DEPTH=257
OUTPUT_FILE="deep.xml"
# Start the XML file with a root tag
echo "<root>" > "$OUTPUT_FILE"
# Generate nested tags
for i in $(seq 1 $DEPTH); do
printf "%0.s " >> "$OUTPUT_FILE" # Indent for readability (optional)
echo "<tag$i>" >> "$OUTPUT_FILE"
done
# Close the nested tags in reverse order
for i in $(seq $DEPTH -1 1); do
printf "%0.s " >> "$OUTPUT_FILE" # Indent for readability (optional)
echo "</tag$i>" >> "$OUTPUT_FILE"
done
# Close the root tag
echo "</root>" >> "$OUTPUT_FILE"
And then I thought we could catch the exception and only use huge_tree, when it's necessary?
from lxml import etree
def parse_with_huge_tree(file_path):
try:
tree = etree.parse(file_path)
print("XML parsed successfully (default settings).")
return tree
except etree.LxmlError as e:
print(f"LxmlError encountered: {e}")
print("Retrying with huge_tree=True...")
try:
parser = etree.XMLParser(huge_tree=True)
tree = etree.parse(file_path, parser)
print("XML parsed successfully with huge_tree=True.")
return tree
except etree.LxmlError as retry_error:
print(f"Retry failed with huge_tree=True: {retry_error}")
except Exception as retry_exception:
print(f"Unexpected error during retry: {retry_exception}")
except Exception as e:
print(f"Unexpected error: {e}")
# Usage
parse_with_huge_tree('./deep.xml')
Make sense?
(and maybe print a warning that it's quite large and we are being forced to use huge_tree, security vulnerability, et cetera)
This PR should resolve the problem with the LXML limit on the size of text nodes it can handle:
For more info about a
huge_tree
option, please see:Especially this LXML FAQ section about security concerns:
TODOs:
Pull Request Checklist: