1

I tried to parse an html table into csv using python with a following script:

from bs4 import BeautifulSoup
import requests
import csv


csvFile = open('log.csv', 'w', newline='')
writer = csv.writer(csvFile)
def parse():
    html = requests.get('https://en.wikipedia.org/wiki/Comparison_of_text_editors')
    bs = BeautifulSoup(html.text, 'lxml')
    table = bs.select_one('table.wikitable')
    rows = table.select('tr')
    for row in rows:
        csvRow = []
        for cell in row.findAll(['th', 'td']):
            csvRow.append(cell.getText())
        writer.writerow(csvRow)
        print(csvRow)


parse()
csvFile.close()

This code outputed a clear formated CSV file with no encoding issues.an example All was just fine before Enrico Tröger's Geany. My script was unable to write ö into a csv file, so i tried this: csvRow.append(cell.text.encode('ascii', 'replace')) instead of that: csvRow.append(cell.getText()) All was fine, despite the fact that each table cell was nested in b''. enter image description here So, how can i get a clear formated csv file withous encoding issues(like in the first screenshot) and replaced or ignored all non-unicode symbols(like in the second screenshot) using my scipt?

1
  • Can you add the full error traceback with the UnicodeDecodeError to the question? Commented Jul 13, 2018 at 15:27

2 Answers 2

6

Change this one:

csvFile = open('log.csv', 'w', newline='')

To this one:

csvFile = open('log.csv', 'w', newline='', encoding='utf8')

csv module documentation:

Since open() is used to open a CSV file for reading, the file will by default be decoded into unicode using the system default encoding (see locale.getpreferredencoding()). To decode a file using a different encoding, use the encoding argument of open:

import csv
with open('some.csv', newline='', encoding='utf-8') as f:
    reader = csv.reader(f)
    for row in reader:
         print(row)

The same applies to writing in something other than the system default encoding: specify the encoding argument when opening the output file.

I suppose your system default encoding is not utf8. You can check it like this:

import locale
locale.getpreferredencoding()

Hope it helps!

Sign up to request clarification or add additional context in comments.

1 Comment

That worked, but i needed to replace csvRow.append(cell.text.encode('ascii', 'replace') to csvRow.append(cell.getText())
1

Looks like the csv module expects strings, not bytes. So you could de-encode your bytes before passing them:

cell.text.encode('ascii', 'replace').decode('ascii')

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.