Python throws ascii codec can't encode when parsing xml

Question

I was running the following code in Python:

import xml.etree.ElementTree as ET
tree = ET.parse('dplp_11.xml')
root = tree.getroot()
f = open('workfile', 'w')
for country in root.findall('article'):
    rank = country.find('year').text
    name = country.find('title').text

    if(int(rank)>2009):
        f.write(name)
        auth = country.findall('author')
        for a in auth:
            #print str(a)
            f.write(a.text)
            f.write(',')
        f.write('\n')

I got an error:

Traceback (most recent call last):
  File "parser.py", line 14, in <module>
    f.write(a.text)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 4: ordinal not in range(128)

I was trying to parse the dblp data which looks like this:

<?xml version="1.0"?>
<dblp>
<article mdate="2011-01-11" key="journals/acta/Saxena96">
<author>Sanjeev Saxena</author>
<title>Parallel Integer Sorting and Simulation Amongst CRCW Models.</title>
<pages>607-619</pages>
<year>1996</year>
<volume>33</volume>
<journal>Acta Inf.</journal>
<number>7</number>
<url>db/journals/acta/acta33.html#Saxena96</url>
<ee>http://dx.doi.org/10.1007/BF03036466</ee>
</article>
<article mdate="2015-07-14" key="journals/acta/BozapalidisFR12">
<author>Symeon Bozapalidis</author>
<author>Zoltán Fülöp 0001</author>
<author>George Rahonis</author>
<title>Equational weighted tree transformations.</title>
<pages>29-52</pages>
<year>2012</year>
<volume>49</volume>
<journal>Acta Inf.</journal>
<number>1</number>
<ee>http://dx.doi.org/10.1007/s00236-011-0148-5</ee>
<url>db/journals/acta/acta49.html#BozapalidisFR12</url>
</article>
</dblp>

How can I resolve it?

Note that it is the f.write() line that throws the exception. It is not the XML parsing that is the issue here, it is writing to the text file that causes the problem. f.write(u'Zolt\xe1n') would give you the exact same error. — Martijn Pieters
– Martijn Pieters, Commented Jul 3, 2016 at 9:47

Martijn Pieters · Accepted Answer · 2016-07-03 09:44:37Z

1

a.text is a Unicode object, but you are trying to write it to a plain Python 2 file object:

f.write(a.text)

The f.write() method only takes a byte string (type str), triggering an implicit encode to the ASCII codec, triggering your exception if the text can't be encoded as ASCII.

You'll either need to explicitly encode it with a codec that can encode your data, or use a io.open() file object that does the encoding for you.

Encoding explicitly to UTF-8 would work, for example:

f.write(a.text.encode('utf8'))

or use io.open() with an explicit encoding:

import io

# ...

f = io.open('workfile', 'w', encoding='utf8')

after which all calls to f.write() must be Unicode objects; prefix any literal strings with u:

for a in auth:
    f.write(a.text)
    f.write(u',')
f.write(u'\n')

edited Jul 3, 2016 at 9:44

answered Jul 3, 2016 at 9:00

Martijn Pieters

1.1m326 gold badges4.2k silver badges3.4k bronze badges

Sign up to request clarification or add additional context in comments.

12 Comments

SAMAHA Over a year ago

When i did this i got another error saying "for country in findall(article) Syntax error:invalid syntax"

Martijn Pieters Over a year ago

@SAMAHA: you didn't close the ) parentheses in a preceding line then.

SAMAHA Over a year ago

:That error is solved.But got another error saying "f.write(name) write() argument 1 must be unicode ,not str"

SAMAHA Over a year ago

My python version is 2.7

Martijn Pieters Over a year ago

@SAMAHA: right, all other writes must be unicode then too; use u',' and u'\n'. I'll update.

|

Collectives™ on Stack Overflow

Python throws ascii codec can't encode when parsing xml

1 Answer 1

12 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

12 Comments

Your Answer

Sign up or log in

Post as a guest

Related