I have a JSON file in which one of the columns is an XML string.
I tried extracting this field and writing to a file in the first step and reading the file in the next step. But each row has an XML header tag. So the resulting file is not a valid XML file.
How can I use the PySpark XML parser ('com.databricks.spark.xml') to read this string and parse out the values?
The following doesn't work:
tr = spark.read.json( "my-file-path")
trans_xml = sqlContext.read.format('com.databricks.spark.xml').options(rowTag='book').load(tr.select("trans_xml"))
Thanks, Ram.