Getting unicode error while parsing xml file

Question

I have a directory of xml files, where a xml file is of the form:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet href="CoreNLP-to-HTML.xsl" type="text/xsl"?>
<root>
  <document>
    <sentences>
      <sentence id="1">
        <tokens>
          <token id="1">
            <word>Brand</word>
            <lemma>brand</lemma>
            <CharacterOffsetBegin>0</CharacterOffsetBegin>
            <CharacterOffsetEnd>5</CharacterOffsetEnd>
            <POS>NN</POS>
            <NER>O</NER>
          </token>
          <token id="2">
            <word>Blogs</word>
            <lemma>blog</lemma>
            <CharacterOffsetBegin>6</CharacterOffsetBegin>
            <CharacterOffsetEnd>11</CharacterOffsetEnd>
            <POS>NNS</POS>
            <NER>O</NER>
          </token>
          <token id="3">
            <word>Capture</word>
            <lemma>capture</lemma>
            <CharacterOffsetBegin>12</CharacterOffsetBegin>
            <CharacterOffsetEnd>19</CharacterOffsetEnd>
            <POS>VBP</POS>
            <NER>O</NER>
          </token>

I am parsing each xml file and storing the word between the tags, and then finding the top 100 words.

I am doing like this:

def find_top_words(xml_directory):
    file_list = []
    temp_list=[]
    file_list2=[]
    for dir_file in os.listdir(xml_directory):
        dir_file_path = os.path.join(xml_directory, dir_file)
        if os.path.isfile(dir_file_path):
            with open(dir_file_path) as f:
                page = f.read()
                soup = BeautifulSoup(page,"xml")
                for word in soup.find_all('word'):
                    file_list.append(str(word.string.strip()))
            f.close()
    for element in file_list:
        s = element.lower()
        file_list2.append(s)
    counts = Counter(file_list2)
    for w in sorted(counts, key=counts.get, reverse=True):
          temp_list.append(w)
    return temp_list[:100]

But, I'm getting this error:

File "prac31.py", line 898, in main
    v = find_top_words('/home/xyz/xml_dir')
  File "prac31.py", line 43, in find_top_words
    file_list.append(str(word.string.strip()))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xef' in position 2: ordinal not in range(128)

What does this mean and how to fix it?

Ludovic Viaud · Accepted Answer · 2014-11-16 01:30:25Z

1

Don't use BeautifulSoup, it's totally deprecated. Why not the standard lib ? if you want something more complex for xml handling you have lxml (but I am pretty sure that you don't)

It will solve your problem easily.

edit: forget the preview answer it was bad -_- your problem is str(my_string) in python 2 if my_string contains non-ascii characters, cause str() in python 2 on a unicode string is like trying to encode as ascii, use the method encode('utf-8') instead.

edited Nov 16, 2014 at 1:30

answered Nov 16, 2014 at 1:01

Ludovic Viaud

2021 silver badge5 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Martha Pears Over a year ago

Could you tell me how to use lxml for that?

Chris Over a year ago

What do you mean that BeautifulSoup is "totally deprecated"? It hasn't had a release in a while, but that's not the same thing. There are good reasons for using something like lxml for XML, but I'm not sure that deprecation is one of them.

Ludovic Viaud Over a year ago

but again I recommend you to use the standard lib for this, if your xml is correct it's the simplest solution out there

Ludovic Viaud Over a year ago

@Chris : you right it's more a statement based on opinion, and it was probably too extreme

Martha Pears Over a year ago

@LudovicViaud : I don't understand that that much. :( From my code, in (str(word.string.strip())), if I remove the str, then I get an output like [u'learning', u'charged', u'h.i.v.', u'maintain', u'unusual'...] This is in unicode form, isn't there any way by which I could make this work and get word and not unicodes?

|

Nima Soroush · Accepted Answer · 2014-11-16 02:02:44Z

0

Str() function encode ascii codec and as your word.string.strip() does not return ascii character some where in your xml file you catch this error. the solution is using:

file_list.append(word.string.strip().encode('utf-8'))

and for returning this value you need to do something like :

for item in file_list:
    print item.decode('utf-8')

Hope it helps.

answered Nov 16, 2014 at 2:02

Nima Soroush

12.9k4 gold badges54 silver badges55 bronze badges

2 Comments

Martha Pears Over a year ago

I was wondering why is decode used? Only encode is giving it as words. :o

Nima Soroush Over a year ago

Decodes obj using the codec registered for encoding. Take a look at here (docs.python.org/2/library/codecs.html#codecs.Codec.decode)

Ned Batchelder · Accepted Answer · 2014-11-16 02:11:20Z

0

In this line of code:

file_list.append(str(word.string.strip()))

why are you using str? The data is Unicode, and you can append unicode strings to a list. If you need a bytestring, then you can use word.string.strip().encode('utf8') instead.

answered Nov 16, 2014 at 2:11

Ned Batchelder

378k77 gold badges583 silver badges675 bronze badges

Collectives™ on Stack Overflow

Getting unicode error while parsing xml file

3 Answers 3

7 Comments

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

7 Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related