lxml parsing with python: how to with objectify

Question

I am trying to read xml behind an spss file, I would like to move from etree to objectify.

How can I convert this function below to return an objectify object? I would like to do this because objectify xml object would be easier for me (as a newbie) to work with as it is more pythonic.

def get_etree(path_file):

    from lxml import etree

    with open(path_file, 'r+') as f:
        xml_text = f.read()     
    recovering_parser = etree.XMLParser(recover=True)    
    xml = etree.parse(StringIO(xml_text), parser=recovering_parser)

    return xml

my failed attempt:

def get_etree(path_file):

    from lxml import etree, objectify

    with open(path_file, 'r+') as f:
        xml_text = objectify.fromstring(xml)   

    return xml

but I get this error:

lxml.etree.XMLSyntaxError: xmlns:mdm: 'http://www.spss.com/mr/dm/metadatamodel/Arc 3/2000-02-04' is not a valid URI

I guess it complains because http://www.spss.com/mr/dm/metadatamodel/Arc 3/2000-02-04 is not a valid url because of the space inside. — alecxe
– alecxe, Commented Dec 3, 2014 at 15:59
Don't ever (!!!) use f.read() to read an XML file. You may easily break the XML that way. Pass the file to etree directly and let etree do the file handling, because etree observes the XML's encoding declaration, whereas f.read() does not. — Tomalak
– Tomalak, Commented Dec 3, 2014 at 16:05

Tomalak · Accepted Answer · 2014-12-03 16:24:22Z

10

The first, biggest mistake is to read a file into a string and feed that string to an XML parser.

Python will read the file as whatever your default file encoding is (unless you specify the encoding when you call read()), and that step will very likely break anything other than plain ASCII files.

XML files come in many encodings, you cannot predict them, and you really shouldn't make assumptions about them. XML files solve that problem with the XML declaration.

<?xml version="1.0" encoding="Windows-1252"?>

An XML parser will read that bit of information and configure itself correctly before reading the rest of the file. Make use of that facility. Never use open() and read() for XML files.

Luckily lxml makes it very easy:

from lxml import etree, objectify

def get_etree(path_file):
    return etree.parse(path_file, parser=etree.XMLParser(recover=True))

def get_objectify(path_file):
    return objectify.parse(path_file)

and

path = r"/path/to/your.xml"
xml1 = get_etree(path)
xml2 = get_objectify(path)

print xml1   # -> <lxml.etree._ElementTree object at 0x02A7B918>
print xml2   # -> <lxml.etree._ElementTree object at 0x02A7B878>

P.S.: Think hard if you really, positively must use a recovering parser. An XML file is a data structure. If it is broken (syntactically invalid, incomplete, wrongly decoded, you name it), would you really want to trust the (by definition undefined) result of an attempt to read it anyway or would you much rather reject it and display an error message?

I would do the latter. Using a recovering parser may cause nasty run-time errors later.

answered Dec 3, 2014 at 16:24

Tomalak

339k68 gold badges547 silver badges635 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Boosted_d16 Over a year ago

Thanks for the detailed explanation. I will do what you said, learn more about xml and etree before moving to objectify.

Tomalak Over a year ago

@mah65 You don't have an XML file that starts with b'...'. You have used open() on an XML file. Don't do that. Use lxml.parse('path/to/file'), lxml will take care of the encoding automatically (the same goes for the built-in ElementTree).

mah65 Over a year ago

I understand that the file should not start with b'....'. But, I have been given such file. I can manually remove them from the fiile, but I don't want to do that. Anyway, my solution works for the moment.

Tomalak Over a year ago

@mah65 " But, I have been given such file." - When you open it with a text editor, does it start with b'...? If not, you don't have such a file.

mah65 Over a year ago

Yes, of course. I have opened it and checked it. It has it. That's not a big problem. Solved.

Collectives™ on Stack Overflow

lxml parsing with python: how to with objectify

1 Answer 1

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related