How to fetch column names from XML schema in SPARK SCALA?

Question

I have a requirement where I need to process a column in a table containing an XML. I am trying to convert the XML column into multiple individual columns based on the tags. I am using the spark_xml class to perform the same. I have taken reference from question parsing XML columns from PySpark Dataframe using UDF but here it all processed in Pyspark, I need an equivalent of in SCALA.

I have come to the point where I can make the parsed column. I need to explode the data and turn the XML tags into column names. I need an equivalent of the below lines from that question in SCALA

df2 = parsed.select(*parsed.columns[:-1],F.explode(F.col('parsed').getItem('visitor')))    

new_col_names = [s.split(':')[0] for s in payloadSchema['visitor'].simpleString().split('<')[-1].strip('>>').split(',')]

Adding XML

<?xml version="1.0" encoding="utf-8"?> <visitors> <visitor id="9615" age="68" sex="F" /> <visitor id="1882" age="34" sex="M" /> <visitor id="5987" age="23" sex="M" /> </visitors>

Output:

> # +---+--------------------+----+----+----+
> # | id|            visitors|_age| _id|_sex|
> # +---+--------------------+----+----+----+
> # |  1|<?xml version="1....|  68|9615|   F|
> # |  1|<?xml version="1....|  34|1882|   M|
> # |  1|<?xml version="1....|  23|5987|   M|
> # +---+--------------------+----+----+----+

Check this post - stackoverflow.com/questions/62379533/… might help you. — s.polam
– s.polam, Commented May 28, 2021 at 9:51

s.polam · Accepted Answer · 2021-05-28 10:38:37Z

Use org.json to convert xml to json.

Sample XML Data

val xmlData = """<?xml version="1.0" encoding="utf-8"?> <visitors> <visitor id="9615" age="68" sex="F" /> <visitor id="1882" age="34" sex="M" /> <visitor id="5987" age="23" sex="M" /> </visitors>"""

UDF Function

val parse = udf((value: String) => {
    import org.json._
    XML.toJSONObject(value).toString
  }
)

Schema for converted json data.

import org.apache.spark.sql.types._

val schema = DataType.fromJson("""{"type":"struct","fields":[{"name":"visitors","type":{"type":"struct","fields":[{"name":"visitor","type":{"type":"array","elementType":{"type":"struct","fields":[{"name":"age","type":"long","nullable":true,"metadata":{}},{"name":"id","type":"long","nullable":true,"metadata":{}},{"name":"sex","type":"string","nullable":true,"metadata":{}}]},"containsNull":true},"nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}}]}""").asInstanceOf[StructType]

scala> schema.printTreeString
root
 |-- visitors: struct (nullable = true)
 |    |-- visitor: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- age: long (nullable = true)
 |    |    |    |-- id: long (nullable = true)
 |    |    |    |-- sex: string (nullable = true)

 df
 .withColumn(
     "parsed_xml", 
     from_json(parse($"xml"),schema)
    )
 .select(
        $"id",
        $"xml",
        explode_outer($"parsed_xml.visitors.visitor").as("visitors")
    )
 .select(
     $"id",
     $"xml",
     $"visitors.*"
    )
 .show(false)

Final Output

+---+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---+----+---+
|id |xml                                                                                                                                                                               |age|id  |sex|
+---+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---+----+---+
|1  |<?xml version="1.0" encoding="utf-8"?> <visitors> <visitor id="9615" age="68" sex="F" /> <visitor id="1882" age="34" sex="M" /> <visitor id="5987" age="23" sex="M" /> </visitors>|68 |9615|F  |
|1  |<?xml version="1.0" encoding="utf-8"?> <visitors> <visitor id="9615" age="68" sex="F" /> <visitor id="1882" age="34" sex="M" /> <visitor id="5987" age="23" sex="M" /> </visitors>|34 |1882|M  |
|1  |<?xml version="1.0" encoding="utf-8"?> <visitors> <visitor id="9615" age="68" sex="F" /> <visitor id="1882" age="34" sex="M" /> <visitor id="5987" age="23" sex="M" /> </visitors>|23 |5987|M  |
+---+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---+----+---+

Thanks for this, but the schema of the xml is not fixed.. There might be few tags missing in real time and that's why I need to generate the schema in runtime like its done in the exmaple I provided.
Get XML schema in xsd format then it easy to generate schema from it.

Collectives™ on Stack Overflow

How to fetch column names from XML schema in SPARK SCALA?

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related