1

I was running the following code in Python:

import xml.etree.ElementTree as ET
tree = ET.parse('dplp_11.xml')
root = tree.getroot()
f = open('workfile', 'w')
for country in root.findall('article'):
    rank = country.find('year').text
    name = country.find('title').text

    if(int(rank)>2009):
        f.write(name)
        auth = country.findall('author')
        for a in auth:
            #print str(a)
            f.write(a.text)
            f.write(',')
        f.write('\n')

I got an error:

Traceback (most recent call last):
  File "parser.py", line 14, in <module>
    f.write(a.text)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 4: ordinal not in range(128)

I was trying to parse the dblp data which looks like this:

<?xml version="1.0"?>
<dblp>
<article mdate="2011-01-11" key="journals/acta/Saxena96">
<author>Sanjeev Saxena</author>
<title>Parallel Integer Sorting and Simulation Amongst CRCW Models.</title>
<pages>607-619</pages>
<year>1996</year>
<volume>33</volume>
<journal>Acta Inf.</journal>
<number>7</number>
<url>db/journals/acta/acta33.html#Saxena96</url>
<ee>http://dx.doi.org/10.1007/BF03036466</ee>
</article>
<article mdate="2015-07-14" key="journals/acta/BozapalidisFR12">
<author>Symeon Bozapalidis</author>
<author>Zoltán Fülöp 0001</author>
<author>George Rahonis</author>
<title>Equational weighted tree transformations.</title>
<pages>29-52</pages>
<year>2012</year>
<volume>49</volume>
<journal>Acta Inf.</journal>
<number>1</number>
<ee>http://dx.doi.org/10.1007/s00236-011-0148-5</ee>
<url>db/journals/acta/acta49.html#BozapalidisFR12</url>
</article>
</dblp>

How can I resolve it?

1
  • Note that it is the f.write() line that throws the exception. It is not the XML parsing that is the issue here, it is writing to the text file that causes the problem. f.write(u'Zolt\xe1n') would give you the exact same error. Commented Jul 3, 2016 at 9:47

1 Answer 1

1

a.text is a Unicode object, but you are trying to write it to a plain Python 2 file object:

f.write(a.text)

The f.write() method only takes a byte string (type str), triggering an implicit encode to the ASCII codec, triggering your exception if the text can't be encoded as ASCII.

You'll either need to explicitly encode it with a codec that can encode your data, or use a io.open() file object that does the encoding for you.

Encoding explicitly to UTF-8 would work, for example:

f.write(a.text.encode('utf8'))

or use io.open() with an explicit encoding:

import io

# ...

f = io.open('workfile', 'w', encoding='utf8')

after which all calls to f.write() must be Unicode objects; prefix any literal strings with u:

for a in auth:
    f.write(a.text)
    f.write(u',')
f.write(u'\n')
Sign up to request clarification or add additional context in comments.

12 Comments

When i did this i got another error saying "for country in findall(article) Syntax error:invalid syntax"
@SAMAHA: you didn't close the ) parentheses in a preceding line then.
:That error is solved.But got another error saying "f.write(name) write() argument 1 must be unicode ,not str"
My python version is 2.7
@SAMAHA: right, all other writes must be unicode then too; use u',' and u'\n'. I'll update.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.