0

I have a directory of xml files, where a xml file is of the form:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet href="CoreNLP-to-HTML.xsl" type="text/xsl"?>
<root>
  <document>
    <sentences>
      <sentence id="1">
        <tokens>
          <token id="1">
            <word>Brand</word>
            <lemma>brand</lemma>
            <CharacterOffsetBegin>0</CharacterOffsetBegin>
            <CharacterOffsetEnd>5</CharacterOffsetEnd>
            <POS>NN</POS>
            <NER>O</NER>
          </token>
          <token id="2">
            <word>Blogs</word>
            <lemma>blog</lemma>
            <CharacterOffsetBegin>6</CharacterOffsetBegin>
            <CharacterOffsetEnd>11</CharacterOffsetEnd>
            <POS>NNS</POS>
            <NER>O</NER>
          </token>
          <token id="3">
            <word>Capture</word>
            <lemma>capture</lemma>
            <CharacterOffsetBegin>12</CharacterOffsetBegin>
            <CharacterOffsetEnd>19</CharacterOffsetEnd>
            <POS>VBP</POS>
            <NER>O</NER>
          </token>

I am parsing each xml file and storing the word between the tags, and then finding the top 100 words.

I am doing like this:

def find_top_words(xml_directory):
    file_list = []
    temp_list=[]
    file_list2=[]
    for dir_file in os.listdir(xml_directory):
        dir_file_path = os.path.join(xml_directory, dir_file)
        if os.path.isfile(dir_file_path):
            with open(dir_file_path) as f:
                page = f.read()
                soup = BeautifulSoup(page,"xml")
                for word in soup.find_all('word'):
                    file_list.append(str(word.string.strip()))
            f.close()
    for element in file_list:
        s = element.lower()
        file_list2.append(s)
    counts = Counter(file_list2)
    for w in sorted(counts, key=counts.get, reverse=True):
          temp_list.append(w)
    return temp_list[:100]

But, I'm getting this error:

File "prac31.py", line 898, in main
    v = find_top_words('/home/xyz/xml_dir')
  File "prac31.py", line 43, in find_top_words
    file_list.append(str(word.string.strip()))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xef' in position 2: ordinal not in range(128)

What does this mean and how to fix it?

0

3 Answers 3

1

Don't use BeautifulSoup, it's totally deprecated. Why not the standard lib ? if you want something more complex for xml handling you have lxml (but I am pretty sure that you don't)

It will solve your problem easily.

edit: forget the preview answer it was bad -_- your problem is str(my_string) in python 2 if my_string contains non-ascii characters, cause str() in python 2 on a unicode string is like trying to encode as ascii, use the method encode('utf-8') instead.

Sign up to request clarification or add additional context in comments.

7 Comments

Could you tell me how to use lxml for that?
What do you mean that BeautifulSoup is "totally deprecated"? It hasn't had a release in a while, but that's not the same thing. There are good reasons for using something like lxml for XML, but I'm not sure that deprecation is one of them.
but again I recommend you to use the standard lib for this, if your xml is correct it's the simplest solution out there
@Chris : you right it's more a statement based on opinion, and it was probably too extreme
@LudovicViaud : I don't understand that that much. :( From my code, in (str(word.string.strip())), if I remove the str, then I get an output like [u'learning', u'charged', u'h.i.v.', u'maintain', u'unusual'...] This is in unicode form, isn't there any way by which I could make this work and get word and not unicodes?
|
0

Str() function encode ascii codec and as your word.string.strip() does not return ascii character some where in your xml file you catch this error. the solution is using:

file_list.append(word.string.strip().encode('utf-8'))

and for returning this value you need to do something like :

for item in file_list:
    print item.decode('utf-8')

Hope it helps.

2 Comments

I was wondering why is decode used? Only encode is giving it as words. :o
Decodes obj using the codec registered for encoding. Take a look at here (docs.python.org/2/library/codecs.html#codecs.Codec.decode)
0

In this line of code:

file_list.append(str(word.string.strip()))

why are you using str? The data is Unicode, and you can append unicode strings to a list. If you need a bytestring, then you can use word.string.strip().encode('utf8') instead.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.