0

I have a requirement where I need to process a column in a table containing an XML. I am trying to convert the XML column into multiple individual columns based on the tags. I am using the spark_xml class to perform the same. I have taken reference from question parsing XML columns from PySpark Dataframe using UDF but here it all processed in Pyspark, I need an equivalent of in SCALA.

I have come to the point where I can make the parsed column. I need to explode the data and turn the XML tags into column names. I need an equivalent of the below lines from that question in SCALA

df2 = parsed.select(*parsed.columns[:-1],F.explode(F.col('parsed').getItem('visitor')))    

new_col_names = [s.split(':')[0] for s in payloadSchema['visitor'].simpleString().split('<')[-1].strip('>>').split(',')]

Adding XML

<?xml version="1.0" encoding="utf-8"?> <visitors> <visitor id="9615" age="68" sex="F" /> <visitor id="1882" age="34" sex="M" /> <visitor id="5987" age="23" sex="M" /> </visitors>

Output:

> # +---+--------------------+----+----+----+
> # | id|            visitors|_age| _id|_sex|
> # +---+--------------------+----+----+----+
> # |  1|<?xml version="1....|  68|9615|   F|
> # |  1|<?xml version="1....|  34|1882|   M|
> # |  1|<?xml version="1....|  23|5987|   M|
> # +---+--------------------+----+----+----+
3
  • can you add your sample xml & expected output ? Commented May 28, 2021 at 9:28
  • Check this post - stackoverflow.com/questions/62379533/… might help you. Commented May 28, 2021 at 9:51
  • Have added the xml. Commented May 28, 2021 at 9:57

1 Answer 1

0

Use org.json to convert xml to json.

Sample XML Data

val xmlData = """<?xml version="1.0" encoding="utf-8"?> <visitors> <visitor id="9615" age="68" sex="F" /> <visitor id="1882" age="34" sex="M" /> <visitor id="5987" age="23" sex="M" /> </visitors>"""

UDF Function

val parse = udf((value: String) => {
    import org.json._
    XML.toJSONObject(value).toString
  }
)

Schema for converted json data.

import org.apache.spark.sql.types._

val schema = DataType.fromJson("""{"type":"struct","fields":[{"name":"visitors","type":{"type":"struct","fields":[{"name":"visitor","type":{"type":"array","elementType":{"type":"struct","fields":[{"name":"age","type":"long","nullable":true,"metadata":{}},{"name":"id","type":"long","nullable":true,"metadata":{}},{"name":"sex","type":"string","nullable":true,"metadata":{}}]},"containsNull":true},"nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}}]}""").asInstanceOf[StructType]
scala> schema.printTreeString
root
 |-- visitors: struct (nullable = true)
 |    |-- visitor: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- age: long (nullable = true)
 |    |    |    |-- id: long (nullable = true)
 |    |    |    |-- sex: string (nullable = true)
 df
 .withColumn(
     "parsed_xml", 
     from_json(parse($"xml"),schema)
    )
 .select(
        $"id",
        $"xml",
        explode_outer($"parsed_xml.visitors.visitor").as("visitors")
    )
 .select(
     $"id",
     $"xml",
     $"visitors.*"
    )
 .show(false)

Final Output

+---+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---+----+---+
|id |xml                                                                                                                                                                               |age|id  |sex|
+---+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---+----+---+
|1  |<?xml version="1.0" encoding="utf-8"?> <visitors> <visitor id="9615" age="68" sex="F" /> <visitor id="1882" age="34" sex="M" /> <visitor id="5987" age="23" sex="M" /> </visitors>|68 |9615|F  |
|1  |<?xml version="1.0" encoding="utf-8"?> <visitors> <visitor id="9615" age="68" sex="F" /> <visitor id="1882" age="34" sex="M" /> <visitor id="5987" age="23" sex="M" /> </visitors>|34 |1882|M  |
|1  |<?xml version="1.0" encoding="utf-8"?> <visitors> <visitor id="9615" age="68" sex="F" /> <visitor id="1882" age="34" sex="M" /> <visitor id="5987" age="23" sex="M" /> </visitors>|23 |5987|M  |
+---+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---+----+---+
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks for this, but the schema of the xml is not fixed.. There might be few tags missing in real time and that's why I need to generate the schema in runtime like its done in the exmaple I provided.
Get XML schema in xsd format then it easy to generate schema from it.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.