Here is the full script:
import requests
import bs4
res = requests.get('https://example.com')
soup = bs4.BeautifulSoup(res.text, 'lxml')
page_HTML_code = soup.prettify()
multiline_code = """{}""".format(page_HTML_code)
f = open("testfile.txt","w+")
f.write(multiline_code)
f.close()
So I'm trying to write the entire Downloaded HTML as a file while keeping it neat and clean.
I do understand that it has problems with the text and can't save certain characters, but I'm not sure how to encode the text correctly.
Can anyone help?
This is the error message that I will get
"C:\Location", line 16, in <module>
f.write(multiline_code)
File "C:\\Anaconda3\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u0421' in position 209: character maps to <undefined>
open('testfile.txt', 'wb')for writing as a binary file. To read the file you will then need to open it withopen('testfile.txt', 'rb').with open('testfile.txt', 'wb') as a_file:followed by an indenteda_file.write(...)instead of using explicitopenandclosestatements. Context managers (thewith ... as ...:syntax) are less likely to go wrong..encode('utf-8'), although I think you might have the same problem. You can also choose to ignore errors with.encode('utf-8', errors='ignore')or one of several other options listed here..encode('utf-8', errors='backslashreplace')may replace the unknown character with the literal string'\u0421', so you wouldn't lose that information, but you may have to do something funky to decode it when you read it back.