Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: report file name of file that chardet fails to read #3524

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

corneliusroemer
Copy link
Contributor

resolves #3519

Tested and it works now, reporting the file name:

codespell --write-changes -i3 -C 5 -H -f -e --count -s --builtin clear,rare,names
Failed to decode file ./pep_sphinx_extensions/tests/pep_lint/test_pep_number.py using detected encoding Windows-1254.
Traceback (most recent call last):
  File "/Users/corneliusromer/micromamba/envs/codespell/bin/codespell", line 8, in <module>
    sys.exit(_script_main())
             ^^^^^^^^^^^^^^
  File "/Users/corneliusromer/code/codespell/codespell_lib/_codespell.py", line 1103, in _script_main
    return main(*sys.argv[1:])
           ^^^^^^^^^^^^^^^^^^^
  File "/Users/corneliusromer/code/codespell/codespell_lib/_codespell.py", line 1300, in main
    bad_count += parse_file(
                 ^^^^^^^^^^^
  File "/Users/corneliusromer/code/codespell/codespell_lib/_codespell.py", line 945, in parse_file
    lines, encoding = file_opener.open(filename)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/corneliusromer/code/codespell/codespell_lib/_codespell.py", line 232, in open
    return self.open_with_chardet(filename)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/corneliusromer/code/codespell/codespell_lib/_codespell.py", line 246, in open_with_chardet
    lines = self.get_lines(f)
            ^^^^^^^^^^^^^^^^^
  File "/Users/corneliusromer/code/codespell/codespell_lib/_codespell.py", line 303, in get_lines
    lines = f.readlines()
            ^^^^^^^^^^^^^
  File "/Users/corneliusromer/micromamba/envs/codespell/lib/python3.12/encodings/cp1254.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 1349: character maps to <undefined>

resolves codespell-project#3519

Tested and it works now, reporting the file name:

```
codespell --write-changes -i3 -C 5 -H -f -e --count -s --builtin clear,rare,names
Failed to decode file ./pep_sphinx_extensions/tests/pep_lint/test_pep_number.py using detected encoding Windows-1254.
Traceback (most recent call last):
  File "/Users/corneliusromer/micromamba/envs/codespell/bin/codespell", line 8, in <module>
    sys.exit(_script_main())
             ^^^^^^^^^^^^^^
  File "/Users/corneliusromer/code/codespell/codespell_lib/_codespell.py", line 1103, in _script_main
    return main(*sys.argv[1:])
           ^^^^^^^^^^^^^^^^^^^
  File "/Users/corneliusromer/code/codespell/codespell_lib/_codespell.py", line 1300, in main
    bad_count += parse_file(
                 ^^^^^^^^^^^
  File "/Users/corneliusromer/code/codespell/codespell_lib/_codespell.py", line 945, in parse_file
    lines, encoding = file_opener.open(filename)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/corneliusromer/code/codespell/codespell_lib/_codespell.py", line 232, in open
    return self.open_with_chardet(filename)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/corneliusromer/code/codespell/codespell_lib/_codespell.py", line 246, in open_with_chardet
    lines = self.get_lines(f)
            ^^^^^^^^^^^^^^^^^
  File "/Users/corneliusromer/code/codespell/codespell_lib/_codespell.py", line 303, in get_lines
    lines = f.readlines()
            ^^^^^^^^^^^^^
  File "/Users/corneliusromer/micromamba/envs/codespell/lib/python3.12/encodings/cp1254.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 1349: character maps to <undefined>
```
codespell_lib/_codespell.py Show resolved Hide resolved
raise
else:
lines = self.get_lines(f)
f.close()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To minimize changes, I would suggest:

        try:
            f = open(filename, encoding=encoding, newline="")
        except LookupError:
            print(
                f"ERROR: Don't know how to handle encoding {encoding}: {filename}",
                file=sys.stderr,
            )
            raise
        else:
            try:
                lines = f.readlines()
            except UnicodeDecodeError:
                print(f"ERROR: Could not detect encoding: {filename}", file=sys.stderr)
                raise
            finally:
                f.close()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to minimize changes? I'm happy to go with whatever you want but let me try to convince you (with Zen of Python):

  • "There should be an obvious way to do it": context managers are the way one should open files. Not use finally, it's messy
  • "Flat is better than nested": Your suggestion has loads of try/else/try/finally nesting, it's hard to grok

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you really think this is not much more readable?

        try:
            with open(filename, encoding=encoding, newline="") as f:
                lines = self.get_lines(f)
        except LookupError:  # Raised by open() if encoding is unknown
            error_msg = f"ERROR: Chardet returned unknown encoding for: {filename}."
            print(error_msg, file=sys.stderr)
            raise
        except UnicodeDecodeError:  # Raised by self.get_lines() if decoding fails
            error_msg = f"ERROR: Failed decoding file: {filename}"
            print(error_msg, file=sys.stderr)
            raise

Also note that you introduced a bug by replacing self.get_lines(f) with f.readlines() in your suggestion 🙈

Copy link
Collaborator

@DimitriPapadopoulos DimitriPapadopoulos Aug 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I agree that in general flat is better, but I would also like to limit the code under try to the strict minimum, as a way to document which exceptions each piece of code is expected to raise. Some linters do enforce that.
    • open only raises LookupError
    • readlines only raises UnicodeDecodeError
  2. I'd like to keep the codebase consistent. See Fix uncaught exception on empty files #2195.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do agree with using context managers to open files. I just didn't know how to make it compatible with try.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, nice to keep try local, but it's a tradeoff with nesting. I commented instead to make clear what raises what.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you be more specific regarding your "I would like to keep codebase consistent"? I don't know what exactly you would like me to change. Is anything inconsistent?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

open_with_chardet and open_with_internal (already fixed in #2195) should be kept as similar as possible and even share code if possible at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification!

codespell_lib/_codespell.py Outdated Show resolved Hide resolved
…nges.

We require the type info, otherwise mypy fails
@corneliusroemer
Copy link
Contributor Author

I've added tests because codecov failed otherwise. Not sure these are super important but what's done is done! Learned something about testing on the way - and mocking!

@DimitriPapadopoulos
Copy link
Collaborator

I am happy with codecov failing on exceptions, but I don't have rights to merge when CI tests fail. With that said, adding tests looks like the best option.

@corneliusroemer
Copy link
Contributor Author

corneliusroemer commented Aug 20, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
2 participants