Load XML string from Column in PySpark

Question

I have a JSON file in which one of the columns is an XML string.

I tried extracting this field and writing to a file in the first step and reading the file in the next step. But each row has an XML header tag. So the resulting file is not a valid XML file.

How can I use the PySpark XML parser ('com.databricks.spark.xml') to read this string and parse out the values?

The following doesn't work:

tr = spark.read.json( "my-file-path")
trans_xml = sqlContext.read.format('com.databricks.spark.xml').options(rowTag='book').load(tr.select("trans_xml"))

Thanks, Ram.

user6022341 · Accepted Answer · 2016-11-06 05:10:37Z

1

Try Hive XPath UDFs (LanguageManual XPathUDF):

>>> from pyspark.sql.functions import expr
>>> df.select(expr("xpath({0}, '{1}')".format(column_name, xpath_expression)))

or Python UDF:

>>> from pyspark.sql.types import *
>>> from pyspark.sql.functions import udf
>>> import xml.etree.ElementTree as ET
>>> schema = ... # Define schema
>>> def parse(s):
...     root = ET.fromstring(s)
        result = ... # Select values
...     return result
>>> df.select(udf(parse, schema)(xml_column))

answered Nov 6, 2016 at 5:10

community wiki

user6022341

Sign up to request clarification or add additional context in comments.

3 Comments

Ram Over a year ago

Thanks! I will try the UDF approach and update how it went. I don't think XPath will work for my case as the data is nested in multiple layers.

emraldinho Over a year ago

What would the schema look like for example?

TheProletariat Over a year ago

Thank you. An example of the schema and value selection here would be immensely helpful.

Collectives™ on Stack Overflow

Load XML string from Column in PySpark

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related