2

I'm attempting to use the XRDTools library to convert Panalytical XRDML files into a more database-friendly format, such as a pandas dataframe.

The XRDTools library is described here: https://github.com/paruch-group/xrdtools. It imports the XRDML file into a Python dictionary. I'm totally new to LXML, so I apologize if this is a simple question.

I've used Anaconda to create Python 2.7 and 3.6 environments specifically to work with the XRDTools package. I'd like to run it in Python 3.6.

In Python 2.7, this code runs smoothly:

import xrdtools
xrd = xrdtools.read_xrdml('filename.xrdml')

Output is a dict:

{u'2Theta': array([63.        , 63.00334225, 63.00668449, ..., 67.99331551,
        67.99665775, 68.        ]),
 u'Lambda': 1.540598,
 u'Omega': array([31.        , 31.00200535, 31.0040107 , ..., 33.9959893 ,
        33.99799465, 34.        ]), ...

I can then use the dictionary like any other Python object.

In Python 3.6, that same code generates this error message:

Traceback (most recent call last):

  File "...\AppData\Local\Continuum\Anaconda2\envs\py36xrd\lib\site-packages\IPython\core\interactiveshell.py", line 2910, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)

  File "<ipython-input-3-b6f5409b8bf9>", line 1, in <module>
    xrd = xrdtools.read_xrdml('filename.xrdml')

  File "...\XRDTools\xrdtools\xrdtools\io.py", line 297, in read_xrdml
    valid = validate_xrdml_schema(filename)

  File ...\XRDTools\xrdtools\xrdtools\io.py", line 43, in validate_xrdml_schema
    xmlschema_doc = etree.parse(f)

  File "src\lxml\etree.pyx", line 3444, in lxml.etree.parse (src\lxml\etree.c:83171)

  File "src\lxml\parser.pxi", line 1855, in lxml.etree._parseDocument (src\lxml\etree.c:121011)

  File "src\lxml\parser.pxi", line 1875, in lxml.etree._parseFilelikeDocument (src\lxml\etree.c:121294)

  File "src\lxml\parser.pxi", line 1770, in lxml.etree._parseDocFromFilelike (src\lxml\etree.c:120078)

  File "src\lxml\parser.pxi", line 1185, in lxml.etree._BaseParser._parseDocFromFilelike (src\lxml\etree.c:114806)

  File "src\lxml\parser.pxi", line 598, in lxml.etree._ParserContext._handleParseResultDoc (src\lxml\etree.c:107724)

  File "src\lxml\parser.pxi", line 709, in lxml.etree._handleParseResult (src\lxml\etree.c:109433)

  File "src\lxml\parser.pxi", line 638, in lxml.etree._raiseParseError (src\lxml\etree.c:108287)

  File "...\XRDTools\xrdtools\xrdtools\data\schemas\XRDMeasurement15.xsd", line 1
    <?xml version="1.0" encoding="UTF-8"?>
                                          ^
XMLSyntaxError: Start tag expected, '<' not found, line 1, column 1

Digging into io.py, there is this function:

def validate_xrdml_schema(filename):
    """Validate the xml schema of a given file.

    Parameters
    ----------
    filename : str
        The Filename of the `.xrdml` file to test.

    Returns
    -------
    float or None
        Returns the version number as float or None if
        the file was not matching any provided xml schema.

    """
    schemas = [(1.5, 'data/schemas/XRDMeasurement15.xsd'),
               (1.4, 'data/schemas/XRDMeasurement14.xsd'),
               (1.3, 'data/schemas/XRDMeasurement13.xsd'),
               (1.2, 'data/schemas/XRDMeasurement12.xsd'),
               (1.1, 'data/schemas/XRDMeasurement11.xsd'),
               (1.0, 'data/schemas/XRDMeasurement10.xsd'),
               ]
    schemas = [(v, os.path.join(package_path, schema)) for v, schema in schemas]

    with open(filename, 'r') as f:
        data_xml = etree.parse(f)

    for version, schema in schemas:
        with open(schema, 'r') as f:
            xmlschema_doc = etree.parse(f)
            xmlschema = etree.XMLSchema(xmlschema_doc)

        valid = xmlschema.validate(data_xml)
        if valid:
            return version
    return None

From what I've read, xmlschema_doc = etree.parse(f) is causing the issues. If I change that line to etree.parse(filename), it'll run without an error, but I'm not sure if that matters at all. I also haven't been able to apply that fix to anything other than a small self-contained cell in a Jupyter notebook.

What causes the error? Is there a way to fix it for Python 3? What's the best way to implement that fix?

Would love to get this resolved. TIA!

Most related problem I could find: Python 3.4 lxml.etree: Start tag expected, '<' not found, line 1, column 1

2
  • Brief check. Are you sure you are sending to lxml correct stream? Should it be bytes or str? Commented Mar 6, 2018 at 13:03
  • Hi - I do not know; what is the "correct stream"? It works in Python 2, but not Python 3, using the same file(s). As I understand it, LXML doesn't read the file itself; one must open the file (as f) and pass that to the parser. Is that what you mean? Commented Mar 6, 2018 at 15:15

1 Answer 1

1

Try:

with io.open(filename, 'r', encoding='utf8') as f:
    data_xml = etree.parse(f)

(io.open because it is same call both for Python 2 and Python 3).

Sign up to request clarification or add additional context in comments.

4 Comments

This was really close! I just used open, not io.open. With encoding='utf8', it runs identically in Python 2 and Python 3. Can you edit your answer and I'll mark it accepted? Thanks!
What? open in Python 2 doesn't have encoding parameter. Otherwise, take it as a general principle. When something works in py2k and not in py3k, it is most often because str in py2k is mess which accepts any garbage.
Sorry, I wasn't clear. Your solution is correct - io.open() works in both Python 2 and Python 3; the fix has been implemented in the dev branch of XRDTools: github.com/paruch-group/xrdtools/issues/3
Adding the encoding='utf-8' solved my issue with opening XSD files with lxml

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.