I'm currently trying to gather text data from a csv file, and convert it into readable XML, according to a pre-defined schema. My issues seems to stem from reading and writing Norwegian special characters (ø,æ,å), and not having an understanding of how to properly use unicode.
with open(inputfile, 'rb') as csvfile:
dialect = csv.Sniffer().sniff(csvfile.read(1024))
csvfile.seek(0)
reader = csv.reader(csvfile, dialect)
idflag=False
try:
for row in reader:
if idflag:
#print row[0]
toEBU(row,id_row)
#idflag=False #for testing, limits iterations
if row[0].lower()=='id':
idflag=True
id_row=row
This is the code for reading the .csv file. The toEbu function handles the XMl conversion:
def toEBU(row,id_row):
file_id=unicode(row[0],"utf-8")
file_source=unicode(row[2],"utf-8")
file_type=unicode(row[3],"utf-8")
file_name=unicode(row[4],"utf-8")
file_desc=unicode(row[5],"utf-8")
file_keys=unicode(row[9],"utf-8")
file_rights=unicode(row[10],"utf-8")
keywords = file_keys.split(',')
#print row[0],row[4]
#Remember to use .strip() to remove spaces before or after string
if file_name=='' or row[1]=='Nei':
print 'Name Error'
return
tree = ET.parse('EBUBase.xml')
EBUMain = tree.getroot()
EBUMain.tag= 'ebucore:ebuCoreMain'
coreMetaData = ET.Element('ebucore:coreMetaData')
EBUMain.append(coreMetaData)
indent(EBUMain)
tree = ET.ElementTree(EBUMain)
xmlfile='xml\\' +file_id.strip()+'.xml'
#xmlfile=xmlfile.encode('utf-8')
print xmlfile
try:
tree.write(xmlfile, xml_declaration=True, encoding='utf-8', method="xml")
except IOError:
print 'Invalid Filename'
The error that I get is the following:
Traceback (most recent call last):
File "extractor.py", line 121, in <module>
main(sys.argv[1:])
File "extractor.py", line 106, in main
toEBU(row,id_row)
File "extractor.py", line 26, in toEBU
file_name=unicode(row[4],"utf-8")
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe5 in position 9: invalid c
ontinuation byte
And the string in row[4] is "Bryllup på Terningen".
I've tried reading the data with a unicode csv parser, but that also appears to create errors. So I'm trying to convert the characters to unicode before writing to XML. Previously, I had issues when writing the same strings, and the code would fail at the tree.write(XX) portions.
Edit: Added a sample from the csv file:
18.6.,,Leveranse,Ferdig redigert 30 min AV presentason,Visning,Formidling,Digital historie,Ingen planer,,,
,,Kontaktperson,Tittel,E-post,Telefon,,,,,
,,XXXX XXXXX XXXXX,XXXXXXXXXXX,[email protected],XXXXXXXX,,,,,
,,,,,,,,,,
Id,Arkiv,Kilde,Modalitet,"Parametre, Filnavn","Beskrivelse, fri tekst",Script,Dreiebok,Opptaksplan,Nøkkelord,Rettigheter
D5.1,Nei,E,Tekst,,Manus til videoforelesning (inneholder deler og bilder som beskrives under),Historisk oversikt over fyr og fyrliv i Frøya og Hitra,,Etter avtale med MMS,"Fyr, fyrstasjon",
D5.2,Ja,E,Video,25 minutter??,Film fyrvokter,Inspeksjonstur på Slettringen,,Opptak gjort av «Frøya Film og bilde» v Petter Vågsvær 2011,Fyrvokter slettringen,??
D5.3,Ja,E,Tekst,Fyr i krig,Digital fortelling,"Krigshistorie på fyr, med fokus på fyr i Trlag",,,"Krig, luftangrep, terningen",
The first lines are ignored, and only the lines beginning with "D5.X" are sent to "toEBU".
grep 'Bryllup' input.csv | hexdump -C?