I have a requirement where I need to process a column in a table containing an XML. I am trying to convert the XML column into multiple individual columns based on the tags. I am using the spark_xml class to perform the same. I have taken reference from question parsing XML columns from PySpark Dataframe using UDF but here it all processed in Pyspark, I need an equivalent of in SCALA.
I have come to the point where I can make the parsed column. I need to explode the data and turn the XML tags into column names. I need an equivalent of the below lines from that question in SCALA
df2 = parsed.select(*parsed.columns[:-1],F.explode(F.col('parsed').getItem('visitor')))
new_col_names = [s.split(':')[0] for s in payloadSchema['visitor'].simpleString().split('<')[-1].strip('>>').split(',')]
Adding XML
<?xml version="1.0" encoding="utf-8"?> <visitors> <visitor id="9615" age="68" sex="F" /> <visitor id="1882" age="34" sex="M" /> <visitor id="5987" age="23" sex="M" /> </visitors>
Output:
> # +---+--------------------+----+----+----+
> # | id| visitors|_age| _id|_sex|
> # +---+--------------------+----+----+----+
> # | 1|<?xml version="1....| 68|9615| F|
> # | 1|<?xml version="1....| 34|1882| M|
> # | 1|<?xml version="1....| 23|5987| M|
> # +---+--------------------+----+----+----+