LXML issue parsing XML schema in Python 3

Question

I'm attempting to use the XRDTools library to convert Panalytical XRDML files into a more database-friendly format, such as a pandas dataframe.

The XRDTools library is described here: https://github.com/paruch-group/xrdtools. It imports the XRDML file into a Python dictionary. I'm totally new to LXML, so I apologize if this is a simple question.

I've used Anaconda to create Python 2.7 and 3.6 environments specifically to work with the XRDTools package. I'd like to run it in Python 3.6.

In Python 2.7, this code runs smoothly:

import xrdtools
xrd = xrdtools.read_xrdml('filename.xrdml')

Output is a dict:

{u'2Theta': array([63.        , 63.00334225, 63.00668449, ..., 67.99331551,
        67.99665775, 68.        ]),
 u'Lambda': 1.540598,
 u'Omega': array([31.        , 31.00200535, 31.0040107 , ..., 33.9959893 ,
        33.99799465, 34.        ]), ...

I can then use the dictionary like any other Python object.

In Python 3.6, that same code generates this error message:

Traceback (most recent call last):

  File "...\AppData\Local\Continuum\Anaconda2\envs\py36xrd\lib\site-packages\IPython\core\interactiveshell.py", line 2910, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)

  File "<ipython-input-3-b6f5409b8bf9>", line 1, in <module>
    xrd = xrdtools.read_xrdml('filename.xrdml')

  File "...\XRDTools\xrdtools\xrdtools\io.py", line 297, in read_xrdml
    valid = validate_xrdml_schema(filename)

  File ...\XRDTools\xrdtools\xrdtools\io.py", line 43, in validate_xrdml_schema
    xmlschema_doc = etree.parse(f)

  File "src\lxml\etree.pyx", line 3444, in lxml.etree.parse (src\lxml\etree.c:83171)

  File "src\lxml\parser.pxi", line 1855, in lxml.etree._parseDocument (src\lxml\etree.c:121011)

  File "src\lxml\parser.pxi", line 1875, in lxml.etree._parseFilelikeDocument (src\lxml\etree.c:121294)

  File "src\lxml\parser.pxi", line 1770, in lxml.etree._parseDocFromFilelike (src\lxml\etree.c:120078)

  File "src\lxml\parser.pxi", line 1185, in lxml.etree._BaseParser._parseDocFromFilelike (src\lxml\etree.c:114806)

  File "src\lxml\parser.pxi", line 598, in lxml.etree._ParserContext._handleParseResultDoc (src\lxml\etree.c:107724)

  File "src\lxml\parser.pxi", line 709, in lxml.etree._handleParseResult (src\lxml\etree.c:109433)

  File "src\lxml\parser.pxi", line 638, in lxml.etree._raiseParseError (src\lxml\etree.c:108287)

  File "...\XRDTools\xrdtools\xrdtools\data\schemas\XRDMeasurement15.xsd", line 1
    <?xml version="1.0" encoding="UTF-8"?>
                                          ^
XMLSyntaxError: Start tag expected, '<' not found, line 1, column 1

Digging into io.py, there is this function:

def validate_xrdml_schema(filename):
    """Validate the xml schema of a given file.

    Parameters
    ----------
    filename : str
        The Filename of the `.xrdml` file to test.

    Returns
    -------
    float or None
        Returns the version number as float or None if
        the file was not matching any provided xml schema.

    """
    schemas = [(1.5, 'data/schemas/XRDMeasurement15.xsd'),
               (1.4, 'data/schemas/XRDMeasurement14.xsd'),
               (1.3, 'data/schemas/XRDMeasurement13.xsd'),
               (1.2, 'data/schemas/XRDMeasurement12.xsd'),
               (1.1, 'data/schemas/XRDMeasurement11.xsd'),
               (1.0, 'data/schemas/XRDMeasurement10.xsd'),
               ]
    schemas = [(v, os.path.join(package_path, schema)) for v, schema in schemas]

    with open(filename, 'r') as f:
        data_xml = etree.parse(f)

    for version, schema in schemas:
        with open(schema, 'r') as f:
            xmlschema_doc = etree.parse(f)
            xmlschema = etree.XMLSchema(xmlschema_doc)

        valid = xmlschema.validate(data_xml)
        if valid:
            return version
    return None

From what I've read, xmlschema_doc = etree.parse(f) is causing the issues. If I change that line to etree.parse(filename), it'll run without an error, but I'm not sure if that matters at all. I also haven't been able to apply that fix to anything other than a small self-contained cell in a Jupyter notebook.

What causes the error? Is there a way to fix it for Python 3? What's the best way to implement that fix?

Would love to get this resolved. TIA!

Most related problem I could find: Python 3.4 lxml.etree: Start tag expected, '<' not found, line 1, column 1

Brief check. Are you sure you are sending to lxml correct stream? Should it be bytes or str? — mcepl
– mcepl, Commented Mar 6, 2018 at 13:03
Hi - I do not know; what is the "correct stream"? It works in Python 2, but not Python 3, using the same file(s). As I understand it, LXML doesn't read the file itself; one must open the file (as f) and pass that to the parser. Is that what you mean? — Evan
– Evan, Commented Mar 6, 2018 at 15:15

mcepl · Accepted Answer · 2018-03-06 18:12:54Z

1

Try:

with io.open(filename, 'r', encoding='utf8') as f:
    data_xml = etree.parse(f)

(io.open because it is same call both for Python 2 and Python 3).

answered Mar 6, 2018 at 18:12

mcepl

2,8362 gold badges24 silver badges42 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Evan Over a year ago

This was really close! I just used open, not io.open. With encoding='utf8', it runs identically in Python 2 and Python 3. Can you edit your answer and I'll mark it accepted? Thanks!

mcepl Over a year ago

What? open in Python 2 doesn't have encoding parameter. Otherwise, take it as a general principle. When something works in py2k and not in py3k, it is most often because str in py2k is mess which accepts any garbage.

Evan Over a year ago

Sorry, I wasn't clear. Your solution is correct - io.open() works in both Python 2 and Python 3; the fix has been implemented in the dev branch of XRDTools: github.com/paruch-group/xrdtools/issues/3

henrycjc Over a year ago

Adding the encoding='utf-8' solved my issue with opening XSD files with lxml

Collectives™ on Stack Overflow

LXML issue parsing XML schema in Python 3

1 Answer 1

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related