1

I tried to parse xml with BeautifulSoup

    content = open(filename, encoding='utf-8').read()
    return BeautifulSoup(content)

And check the source file's codec, it told me it should be ascii

➜  worker git:(develop) ✗ chardetect ../complete_data/sample.xml                                                                    git:(develop|✚9…
../complete_data/sample.xml: ascii with confidence 1.0

However, it still breaks my program with exception,

How could I fix it, furthermore, how could I know the correct encoding in the future, and the exception message from Python is so poor

Exception

Traceback (most recent call last):
  File "parser_factory.py", line 97, in <module>
    test_shareholder_meetings()
  File "parser_factory.py", line 81, in test_shareholder_meetings
    _import_source_files(collection_name="shareholder_meetings", dataset_name="WSH_BoD_Shareholder")
  File "parser_factory.py", line 78, in _import_source_files
    parser(f, collection_name).import_data()
  File "/workspace/balala-wsh/worker/parser_base.py", line 21, in __init__
    self.soup = self.read_file_in_bs(filename)
  File "/workspace/balala-wsh/worker/parser_base.py", line 30, in read_file_in_bs
    content = open(filename, encoding='utf-8').read()
  File "/Users/sample_user/.pyenv/versions/3.4.3/lib/python3.4/codecs.py", line 319, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe7 in position 180145: invalid continuation byte

2 Answers 2

1

you can try 'cp1252' to decode the test.
I believe the test you are reading is not Unicode.

Sign up to request clarification or add additional context in comments.

Comments

1

chardet does not examine the entire file. If it contains a lone 0xE7, it's certainly not ASCII, and apparently not UTF-8, either.

Perhaps https://tripleee.github.io/8bit#e7 can help you determine what it really is.

1 Comment

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.