2

I am trying to read xml behind an spss file, I would like to move from etree to objectify.

How can I convert this function below to return an objectify object? I would like to do this because objectify xml object would be easier for me (as a newbie) to work with as it is more pythonic.

def get_etree(path_file):

    from lxml import etree

    with open(path_file, 'r+') as f:
        xml_text = f.read()     
    recovering_parser = etree.XMLParser(recover=True)    
    xml = etree.parse(StringIO(xml_text), parser=recovering_parser)

    return xml

my failed attempt:

def get_etree(path_file):

    from lxml import etree, objectify

    with open(path_file, 'r+') as f:
        xml_text = objectify.fromstring(xml)   

    return xml

but I get this error:

lxml.etree.XMLSyntaxError: xmlns:mdm: 'http://www.spss.com/mr/dm/metadatamodel/Arc 3/2000-02-04' is not a valid URI
6
  • 1
    I guess it complains because http://www.spss.com/mr/dm/metadatamodel/Arc 3/2000-02-04 is not a valid url because of the space inside. Commented Dec 3, 2014 at 15:59
  • See stackoverflow.com/questions/18692965/…. Commented Dec 3, 2014 at 16:00
  • 1
    Don't ever (!!!) use f.read() to read an XML file. You may easily break the XML that way. Pass the file to etree directly and let etree do the file handling, because etree observes the XML's encoding declaration, whereas f.read() does not. Commented Dec 3, 2014 at 16:05
  • Tomalak - cool, didnt know that. Commented Dec 3, 2014 at 16:07
  • Do I have to install any package to use "objectify"? Commented Dec 3, 2014 at 16:09

1 Answer 1

10

The first, biggest mistake is to read a file into a string and feed that string to an XML parser.

Python will read the file as whatever your default file encoding is (unless you specify the encoding when you call read()), and that step will very likely break anything other than plain ASCII files.

XML files come in many encodings, you cannot predict them, and you really shouldn't make assumptions about them. XML files solve that problem with the XML declaration.

<?xml version="1.0" encoding="Windows-1252"?>

An XML parser will read that bit of information and configure itself correctly before reading the rest of the file. Make use of that facility. Never use open() and read() for XML files.

Luckily lxml makes it very easy:

from lxml import etree, objectify

def get_etree(path_file):
    return etree.parse(path_file, parser=etree.XMLParser(recover=True))

def get_objectify(path_file):
    return objectify.parse(path_file)

and

path = r"/path/to/your.xml"
xml1 = get_etree(path)
xml2 = get_objectify(path)

print xml1   # -> <lxml.etree._ElementTree object at 0x02A7B918>
print xml2   # -> <lxml.etree._ElementTree object at 0x02A7B878>

P.S.: Think hard if you really, positively must use a recovering parser. An XML file is a data structure. If it is broken (syntactically invalid, incomplete, wrongly decoded, you name it), would you really want to trust the (by definition undefined) result of an attempt to read it anyway or would you much rather reject it and display an error message?

I would do the latter. Using a recovering parser may cause nasty run-time errors later.

Sign up to request clarification or add additional context in comments.

5 Comments

Thanks for the detailed explanation. I will do what you said, learn more about xml and etree before moving to objectify.
@mah65 You don't have an XML file that starts with b'...'. You have used open() on an XML file. Don't do that. Use lxml.parse('path/to/file'), lxml will take care of the encoding automatically (the same goes for the built-in ElementTree).
I understand that the file should not start with b'....'. But, I have been given such file. I can manually remove them from the fiile, but I don't want to do that. Anyway, my solution works for the moment.
@mah65 " But, I have been given such file." - When you open it with a text editor, does it start with b'...? If not, you don't have such a file.
Yes, of course. I have opened it and checked it. It has it. That's not a big problem. Solved.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.