2

I've searched the site and haven't found an answer that works for me. My problem is that I'm trying to write xml to a file and when I run the script from the terminal I get:

Traceback (most recent call last):
File "fetchWiki.py", line 145, in <module>
pageDictionary = qSQL(users_database)
File "fetchWiki.py", line 107, in qSQL
writeXML(listNS)
File "fetchWiki.py", line 139, in writeXML
f1.write(doc.toprettyxml(indent="\t", encoding="utf-8"))       
File "/usr/lib/python2.7/xml/dom/minidom.py", line 57, in toprettyxml
self.writexml(writer, "", indent, newl, encoding)
File "/usr/lib/python2.7/xml/dom/minidom.py", line 1751, in writexml
node.writexml(writer, indent, addindent, newl)
----//---- more lines in here ----//----
self.childNodes[0].writexml(writer, '', '', '')
File "/usr/lib/python2.7/xml/dom/minidom.py", line 1040, in writexml
_write_data(writer, "%s%s%s" % (indent, self.data, newl))
File "/usr/lib/python2.7/xml/dom/minidom.py", line 297, in _write_data
writer.write(data)
File "/usr/lib/python2.7/codecs.py", line 351, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 1176: ordinal not
in range(128)

This is from the following code:

doc = Document()

base = doc.createElement('Wiki')
doc.appendChild(base)

for ns_dict in listNamespaces: 
    namespace = doc.createElement('Namespace')
    base.appendChild(namespace)
    namespace.setAttribute('NS', ns_dict)

    for title in listNamespaces[ns_dict]:

        page = doc.createElement('Page')
        try:
            title.encode('utf8')
            page.setAttribute('Title', title)
        except:
            newTitle = title.decode('latin1', 'ignore')
            newTitle.encode('utf8', 'ignore')
            page.setAttribute('Title', newTitle)

        namespace.appendChild(page)
        text = doc.createElement('Content')
        text_content = doc.createTextNode(listNamespaces[ns_dict][title])
        text.appendChild(text_content)
        page.appendChild(text)

    f1  = open('pageText.xml', 'w')
    f1.write(doc.toprettyxml(indent="\t", encoding="utf-8"))       
    f1.close()

With or without the encode / decode 'igonore' parameter the error occurs. Adding

# -*- coding: utf-8 -*- 

does not help.

I created the python document using Eclipse with Pydoc and it works fine with no problems, but when I run it from the terminal it errors.

Any help is much appreciated including links to answers I did not find.

Thanks.

1 Answer 1

7

You should not encode the strings you use for attributes. The minidom library handles those for you when writing.

Your error is caused by mixing bytestrings with unicode data, and your encoded bytestrings are not decodable as ASCII.

If some of your data is encoded, and some of it is in unicode, try to avoid that situation in the first place. If you cannot avoid having to handle mixed data, do this instead:

page = doc.createElement('Page')
if not isinstance(title, unicode):
    title = title.decode('latin1', 'ignore')
page.setAttribute('Title', title)

Note that you don't need to use doc.toprettyxml(); you can instruct doc.writexml() to indent your XML for you as well:

import codecs
with codecs.open('pageText.xml', 'w', encoding='utf8') as f1:
    doc.writexml(f1, indent='\t', newl='\n')
Sign up to request clarification or add additional context in comments.

4 Comments

I'll give it a try thanks. Edit: I tried updating the code with your suggestion but the UnicodeDecode error is still occuring on the same character
Can your reduce your code to the simplest version that still triggers the error? What data triggers the error?
I know its the title attribute that is causing this. There is a German name with an umlaouted U. The other puzzling thing is this code runs fine in Eclipse, just not from the terminal
Eclipse changes the default encoding in the terminal; if this is caused by printing to the terminal, see wiki.python.org/moin/PrintFails.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.