1

Here is the full script:

import requests
import bs4


res = requests.get('https://example.com')
soup = bs4.BeautifulSoup(res.text, 'lxml')
page_HTML_code = soup.prettify()

multiline_code = """{}""".format(page_HTML_code)

f = open("testfile.txt","w+")
f.write(multiline_code)
f.close()

So I'm trying to write the entire Downloaded HTML as a file while keeping it neat and clean.

I do understand that it has problems with the text and can't save certain characters, but I'm not sure how to encode the text correctly.

Can anyone help?

This is the error message that I will get

"C:\Location", line 16, in <module>
    f.write(multiline_code)
  File "C:\\Anaconda3\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u0421' in position 209: character maps to <undefined>
9
  • Try open('testfile.txt', 'wb') for writing as a binary file. To read the file you will then need to open it with open('testfile.txt', 'rb'). Commented May 9, 2018 at 16:56
  • Also, use with open('testfile.txt', 'wb') as a_file: followed by an indented a_file.write(...) instead of using explicit open and close statements. Context managers (the with ... as ...: syntax) are less likely to go wrong. Commented May 9, 2018 at 16:57
  • 1
    You could try encoding with .encode('utf-8'), although I think you might have the same problem. You can also choose to ignore errors with .encode('utf-8', errors='ignore') or one of several other options listed here. Commented May 9, 2018 at 17:16
  • 1
    For instance, I think .encode('utf-8', errors='backslashreplace') may replace the unknown character with the literal string '\u0421', so you wouldn't lose that information, but you may have to do something funky to decode it when you read it back. Commented May 9, 2018 at 17:18
  • 1
    @Engineero thanks for your help. :) Just posted an answer to my own question that did the trick. Commented May 9, 2018 at 17:18

1 Answer 1

1

I did some digging around and this worked:

import requests
import bs4


res = requests.get('https://example.com')

soup = bs4.BeautifulSoup(res.text, 'lxml')

page_HTML_code = soup.prettify()



multiline_code = """{}""".format(page_HTML_code)

#add the Encoding part when opening file and this did the trick
with open('testfile.html', 'w+', encoding='utf-8') as fb:
    fb.write(multiline_code)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.