2

I'm currently trying to gather text data from a csv file, and convert it into readable XML, according to a pre-defined schema. My issues seems to stem from reading and writing Norwegian special characters (ø,æ,å), and not having an understanding of how to properly use unicode.

with open(inputfile, 'rb') as csvfile:
dialect = csv.Sniffer().sniff(csvfile.read(1024))
csvfile.seek(0)
reader = csv.reader(csvfile, dialect)
idflag=False
try:
    for row in reader:
        if idflag:
            #print row[0]
            toEBU(row,id_row)
            #idflag=False #for testing, limits iterations
        if row[0].lower()=='id':
            idflag=True
            id_row=row

This is the code for reading the .csv file. The toEbu function handles the XMl conversion:

def toEBU(row,id_row):
file_id=unicode(row[0],"utf-8")
file_source=unicode(row[2],"utf-8")
file_type=unicode(row[3],"utf-8")
file_name=unicode(row[4],"utf-8")
file_desc=unicode(row[5],"utf-8")
file_keys=unicode(row[9],"utf-8")
file_rights=unicode(row[10],"utf-8")
keywords = file_keys.split(',')
#print row[0],row[4]
#Remember to use .strip() to remove spaces before or after string

if file_name=='' or row[1]=='Nei':
    print 'Name Error'
    return


tree = ET.parse('EBUBase.xml')
EBUMain = tree.getroot()
EBUMain.tag= 'ebucore:ebuCoreMain'
coreMetaData = ET.Element('ebucore:coreMetaData')
EBUMain.append(coreMetaData)

indent(EBUMain)


tree = ET.ElementTree(EBUMain)
xmlfile='xml\\' +file_id.strip()+'.xml'

#xmlfile=xmlfile.encode('utf-8')
print xmlfile
try:
    tree.write(xmlfile, xml_declaration=True, encoding='utf-8', method="xml")
except IOError:
    print 'Invalid Filename'

The error that I get is the following:

Traceback (most recent call last):
  File "extractor.py", line 121, in <module>
    main(sys.argv[1:])
  File "extractor.py", line 106, in main
    toEBU(row,id_row)
  File "extractor.py", line 26, in toEBU
    file_name=unicode(row[4],"utf-8")
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe5 in position 9: invalid c
ontinuation byte

And the string in row[4] is "Bryllup på Terningen".

I've tried reading the data with a unicode csv parser, but that also appears to create errors. So I'm trying to convert the characters to unicode before writing to XML. Previously, I had issues when writing the same strings, and the code would fail at the tree.write(XX) portions.

Edit: Added a sample from the csv file:

18.6.,,Leveranse,Ferdig redigert 30 min AV presentason,Visning,Formidling,Digital historie,Ingen planer,,,
,,Kontaktperson,Tittel,E-post,Telefon,,,,,
,,XXXX XXXXX XXXXX,XXXXXXXXXXX,[email protected],XXXXXXXX,,,,,
,,,,,,,,,,
Id,Arkiv,Kilde,Modalitet,"Parametre, Filnavn","Beskrivelse, fri tekst",Script,Dreiebok,Opptaksplan,Nøkkelord,Rettigheter
D5.1,Nei,E,Tekst,,Manus til videoforelesning (inneholder deler og bilder  som beskrives under),Historisk oversikt over fyr og fyrliv i Frøya og Hitra,,Etter avtale med MMS,"Fyr, fyrstasjon",
D5.2,Ja,E,Video,25 minutter??,Film fyrvokter,Inspeksjonstur på Slettringen,,Opptak gjort av «Frøya Film og bilde» v Petter Vågsvær 2011,Fyrvokter slettringen,??
D5.3,Ja,E,Tekst,Fyr i krig,Digital fortelling,"Krigshistorie på fyr, med fokus på fyr i Trlag",,,"Krig, luftangrep, terningen",

The first lines are ignored, and only the lines beginning with "D5.X" are sent to "toEBU".

6
  • Are you sure, the file you are reading is encoded using utf-8? Commented Nov 30, 2013 at 14:52
  • No, as mentioned, I really have no idea what I'm doing. It's a cvs document. I sort of assumed that trying to convert everything to utf-8 would magically solve everything. As I thought utf-8 would support 'æøå' Commented Nov 30, 2013 at 15:07
  • Could you paste an example of a couple words containing non-ASCII characters exactly as they appear in the original CSV? For example by adding the output of grep 'Bryllup' input.csv | hexdump -C? Commented Nov 30, 2013 at 15:21
  • Could you give me a little sample of the csv file? Commented Nov 30, 2013 at 15:22
  • Added a sample, re-edited to remove contact information. Commented Nov 30, 2013 at 16:13

1 Answer 1

4

To boil it down, your file is likely encoded in 'iso8859-1'. I can create a (smaller) version of your file with:

from codecs import EncodedFile
with EncodedFile(open('n.txt','wb'),'utf-8','iso8859-1') as f:
 f.write('Bryllup på Terningen')

the params for EncodedFile indicate that the original (in python) is 'utf-8' and the file is encoded with 'iso8859-1'. Now if I read the file using 'iso8859-1' I'm ok but 'utf-8' will give your error:

>>> unicode(open('n.txt','rb').read(),'iso8859-1')
u'Bryllup p\xe5 Terningen'

>>> unicode(open('n.txt','rb').read(),'utf-8')
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-8-2649b633aa10> in <module>()
----> 1 unicode(open('n.txt','rb').read(),'utf-8')

UnicodeDecodeError: 'utf8' codec can't decode byte 0xe5 in position 9: invalid continuation byte
Sign up to request clarification or add additional context in comments.

2 Comments

Thank you! Is there any form of accepted "best practice" for how to encode the strings in XML files? I'm wondering if I should bother converting to unicode, and just keep everything in the iso8859 format, or if I should write using utf-8.
I know this is a couple years old, but I just ran into a similar issue and wanted to note that I ended up converting everything to unicode rather than writing in utf-8. Not sure if this will help future users or if it's even best practice, but it solved it for me.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.