3

I have a JSON file in which one of the columns is an XML string.

I tried extracting this field and writing to a file in the first step and reading the file in the next step. But each row has an XML header tag. So the resulting file is not a valid XML file.

How can I use the PySpark XML parser ('com.databricks.spark.xml') to read this string and parse out the values?

The following doesn't work:

tr = spark.read.json( "my-file-path")
trans_xml = sqlContext.read.format('com.databricks.spark.xml').options(rowTag='book').load(tr.select("trans_xml"))

Thanks, Ram.

0

1 Answer 1

1

Try Hive XPath UDFs (LanguageManual XPathUDF):

>>> from pyspark.sql.functions import expr
>>> df.select(expr("xpath({0}, '{1}')".format(column_name, xpath_expression)))

or Python UDF:

>>> from pyspark.sql.types import *
>>> from pyspark.sql.functions import udf
>>> import xml.etree.ElementTree as ET
>>> schema = ... # Define schema
>>> def parse(s):
...     root = ET.fromstring(s)
        result = ... # Select values
...     return result
>>> df.select(udf(parse, schema)(xml_column))
Sign up to request clarification or add additional context in comments.

3 Comments

Thanks! I will try the UDF approach and update how it went. I don't think XPath will work for my case as the data is nested in multiple layers.
What would the schema look like for example?
Thank you. An example of the schema and value selection here would be immensely helpful.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.