Python issues with csv,xml and special characters

Question

I'm currently trying to gather text data from a csv file, and convert it into readable XML, according to a pre-defined schema. My issues seems to stem from reading and writing Norwegian special characters (ø,æ,å), and not having an understanding of how to properly use unicode.

with open(inputfile, 'rb') as csvfile:
dialect = csv.Sniffer().sniff(csvfile.read(1024))
csvfile.seek(0)
reader = csv.reader(csvfile, dialect)
idflag=False
try:
    for row in reader:
        if idflag:
            #print row[0]
            toEBU(row,id_row)
            #idflag=False #for testing, limits iterations
        if row[0].lower()=='id':
            idflag=True
            id_row=row

This is the code for reading the .csv file. The toEbu function handles the XMl conversion:

def toEBU(row,id_row):
file_id=unicode(row[0],"utf-8")
file_source=unicode(row[2],"utf-8")
file_type=unicode(row[3],"utf-8")
file_name=unicode(row[4],"utf-8")
file_desc=unicode(row[5],"utf-8")
file_keys=unicode(row[9],"utf-8")
file_rights=unicode(row[10],"utf-8")
keywords = file_keys.split(',')
#print row[0],row[4]
#Remember to use .strip() to remove spaces before or after string

if file_name=='' or row[1]=='Nei':
    print 'Name Error'
    return


tree = ET.parse('EBUBase.xml')
EBUMain = tree.getroot()
EBUMain.tag= 'ebucore:ebuCoreMain'
coreMetaData = ET.Element('ebucore:coreMetaData')
EBUMain.append(coreMetaData)

indent(EBUMain)


tree = ET.ElementTree(EBUMain)
xmlfile='xml\\' +file_id.strip()+'.xml'

#xmlfile=xmlfile.encode('utf-8')
print xmlfile
try:
    tree.write(xmlfile, xml_declaration=True, encoding='utf-8', method="xml")
except IOError:
    print 'Invalid Filename'

The error that I get is the following:

Traceback (most recent call last):
  File "extractor.py", line 121, in <module>
    main(sys.argv[1:])
  File "extractor.py", line 106, in main
    toEBU(row,id_row)
  File "extractor.py", line 26, in toEBU
    file_name=unicode(row[4],"utf-8")
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe5 in position 9: invalid c
ontinuation byte

And the string in row[4] is "Bryllup på Terningen".

I've tried reading the data with a unicode csv parser, but that also appears to create errors. So I'm trying to convert the characters to unicode before writing to XML. Previously, I had issues when writing the same strings, and the code would fail at the tree.write(XX) portions.

Edit: Added a sample from the csv file:

18.6.,,Leveranse,Ferdig redigert 30 min AV presentason,Visning,Formidling,Digital historie,Ingen planer,,,
,,Kontaktperson,Tittel,E-post,Telefon,,,,,
,,XXXX XXXXX XXXXX,XXXXXXXXXXX,[email protected],XXXXXXXX,,,,,
,,,,,,,,,,
Id,Arkiv,Kilde,Modalitet,"Parametre, Filnavn","Beskrivelse, fri tekst",Script,Dreiebok,Opptaksplan,Nøkkelord,Rettigheter
D5.1,Nei,E,Tekst,,Manus til videoforelesning (inneholder deler og bilder  som beskrives under),Historisk oversikt over fyr og fyrliv i Frøya og Hitra,,Etter avtale med MMS,"Fyr, fyrstasjon",
D5.2,Ja,E,Video,25 minutter??,Film fyrvokter,Inspeksjonstur på Slettringen,,Opptak gjort av «Frøya Film og bilde» v Petter Vågsvær 2011,Fyrvokter slettringen,??
D5.3,Ja,E,Tekst,Fyr i krig,Digital fortelling,"Krigshistorie på fyr, med fokus på fyr i Trlag",,,"Krig, luftangrep, terningen",

The first lines are ignored, and only the lines beginning with "D5.X" are sent to "toEBU".

Are you sure, the file you are reading is encoded using utf-8? — fuesika
– fuesika, Commented Nov 30, 2013 at 14:52
No, as mentioned, I really have no idea what I'm doing. It's a cvs document. I sort of assumed that trying to convert everything to utf-8 would magically solve everything. As I thought utf-8 would support 'æøå' — user3052315
– user3052315, Commented Nov 30, 2013 at 15:07
Could you paste an example of a couple words containing non-ASCII characters exactly as they appear in the original CSV? For example by adding the output of grep 'Bryllup' input.csv | hexdump -C? — Lukas Graf
– Lukas Graf, Commented Nov 30, 2013 at 15:21

Phil Cooper · Accepted Answer · 2013-11-30 16:31:00Z

4

To boil it down, your file is likely encoded in 'iso8859-1'. I can create a (smaller) version of your file with:

from codecs import EncodedFile
with EncodedFile(open('n.txt','wb'),'utf-8','iso8859-1') as f:
 f.write('Bryllup på Terningen')

the params for EncodedFile indicate that the original (in python) is 'utf-8' and the file is encoded with 'iso8859-1'. Now if I read the file using 'iso8859-1' I'm ok but 'utf-8' will give your error:

>>> unicode(open('n.txt','rb').read(),'iso8859-1')
u'Bryllup p\xe5 Terningen'

>>> unicode(open('n.txt','rb').read(),'utf-8')
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-8-2649b633aa10> in <module>()
----> 1 unicode(open('n.txt','rb').read(),'utf-8')

UnicodeDecodeError: 'utf8' codec can't decode byte 0xe5 in position 9: invalid continuation byte

answered Nov 30, 2013 at 16:31

Phil Cooper

5,8871 gold badge27 silver badges41 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

user3052315 Over a year ago

Thank you! Is there any form of accepted "best practice" for how to encode the strings in XML files? I'm wondering if I should bother converting to unicode, and just keep everything in the iso8859 format, or if I should write using utf-8.

Robert Ingrum Over a year ago

I know this is a couple years old, but I just ran into a similar issue and wanted to note that I ended up converting everything to unicode rather than writing in utf-8. Not sure if this will help future users or if it's even best practice, but it solved it for me.

Collectives™ on Stack Overflow

Python issues with csv,xml and special characters

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related