3

I have data in the database and I want to export it to CSV file. The data is in Japanese and for further use, I need it to be encoded in UTF-8

This is my script to get the data and write in the CSV file

import mysql.connector
from mysql.connector import errorcode

import sys
import csv

query = 'SELECT * FROM `images-data`'

try:
    cnx = mysql.connector.connect(user='root', password='1234',
                                 host='127.0.0.1',
                                 database='sotsuken-test-db')
    cur=cnx.cursor()
    cur.execute(query)
    result=cur.fetchall()

    c = csv.writer(open("db-data.csv","w"))
    for row in result:
        c.writerow(row)



except mysql.connector.Error as err:
    if err.errno == errorcode.ER_ACCESS_DENIED_ERROR:
        print("Something is wrong with your user name or password")
    elif err.errno == errorcode.ER_BAD_DB_ERROR:
        print("Database does not exist")
    else:
        print(err)
else:
    cnx.close()

I can create the CSV file, and the data is exported in UTF-8, but my CSV file's data was like this:

1,b'\xe3\x83\x86\xe3\x82\xb9\xe3\x83\x88\xe3\x81\xa7\xe3\x81\x94\xe3\x81\x8a\xe3\x81\x96\xe3\x81\x84\xe3\x81\xbe\xe3\x81\x99'

In my research I found that my data was written in byte object (may be I'm wrong here). I need the data to be written in UTF-8 without b''

I know that I can use decode("UTF-8") to make it into the string, but I cannot apply that to csv.writerow

Can you give me any suggestions?

4
  • why not use csv.DictWriter command, you decode ('UTF-8') your individual cell values and using DictWriter you could write to specific columns in your csv file Commented Aug 2, 2018 at 5:48
  • Thank you for a nice, consize, well written question following about all guidelines for new questions/on topic/mvce Commented Aug 2, 2018 at 5:52
  • 1
    @SuryaTej How would DictWriter solve this any better than writer? Sure, it's nicer for other reasons, if you know what your columns are, but it doesn't change the fact that some of those columns are bytes. Commented Aug 2, 2018 at 5:55
  • As a side note, you don't actually need to do fetchall here. You can just iterate for row in cur: and the cursor should give you one row at a time, buffering up as many rows in memory as it thinks would be most efficient, instead of reading them all into memory no matter what. Also, it's usually a bad idea to not close files you opened for writing, either with a close() call or a with statement. Commented Aug 2, 2018 at 5:59

1 Answer 1

2

csv.writer.writerow just takes a list of whatever you've got,1 calls str on each element, and puts them together into a line of CSV.

So, if what you've got is bytes objects, then it's going to call str on them and get strings like b'\xe3\x83\x86\xe3\x82\xb9\xe3\x83\x88\xe3\x81\xa7\xe3\x81\x94\xe3\x81\x8a\xe3\x81\x96\xe3\x81\x84\xe3\x81\xbe\xe3\x81\x99'.

The solution is to give it strings.


The simplest way to do it is by decoding those bytes:

I know that I can use decode("UTF-8") to make it into the string, but I cannot apply that to csv.writerow

Sure you can. I don't know what your columns are, so I'll make up my own example:

for image_id, image_name in result:
    c.writerow([image_id, image_name.decode('UTF-8')])

That's really all there is to it.


What if your code has absolutely no idea what the columns in the database are? Well, in that case, you have to get a little bit clever. As long as we know that all of the bytes columns are really Unicode text in disguise as UTF-8, you can just switch on the type:

for row in result:
    textrow = [col.decode('UTF-8') if isinstance(col, bytes) else col for col in row]
    csv.writerow(textrow)

That's a bit ugly, but reading * from a database where you have no idea what the columns are is inherently an ugly problem.


However, there's probably a much better solution.

MySQL Connector/Python turns all CHAR, TEXT, and similar columns into Unicode str values by default.2 But BINARY, BLOB, and similar columns are always returned as bytes.

If this column is meant to represent Unicode text, set it to a text type instead of a binary one in the database. Then you won't have a problem in the first place—in this script, or in any other tools.


1. Actually, it appears to be illegal to call it with a list of anything but strings and numbers, according to the docs. But practically, it takes anything at all; it just doesn't do anything very useful with any types besides strings and numbers…

2. That is, unless you explicitly pass it a use_unicode=False in the connect call or similarly set it elsewhere.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.