-1

I have a spark dataframe, which has a columns value, key and others, value column has an xml as string

Now i would like to create a new dataframe where the xml content of value column is read as if i am reading spark.read.xml and append the other columns like key to the new DF

Is this possible?

I am generally reading the xml files using this

dfx = spark.read.load('books.xml', format='xml', rowTag='bks:books', valueTag="_ele_value")
dfx.schema

Trying to get the similar dataframe output when trying to read it from the value column (this is coming from kafka)

My xml has a deeply nested structure, just a example of books xml with 2 levels nested

<?xml version="1.0" encoding="UTF-8"?>
<bks:books xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:bks="urn:books"
           xsi:schemaLocation="urn:books books.xsd" xmlns:ot="http://maven.apache.org/POM/4.0.0">
    <book id="b001">
        <author>Brandon Sanderson</author>
        <title>Mistborn</title>
        <genre sub='epic'>Fantasy</genre>
        <price>50</price>
        <pub_date>2006-12-17T09:30:47.0Z</pub_date>
        <review>
            <title>Wonderful</title>
            <content>I love the plot twist and the new magic</content>
        </review>
        <review>
            <title>Unbelievable twist</title>
            <content>The best book i ever read</content>
        </review>
        <sold>10</sold>
    </book>
    <book id="b002">
        <author>Brandon Sanderson</author>
        <title>Way of Kings</title>
        <genre sub='epic'>Fantasy</genre>
        <price>50</price>
        <pub_date>2006-12-17T09:30:47.0Z</pub_date>
        <sold>10</sold>
    </book>
</bks:books>
3
  • This might be of some help stackoverflow.com/questions/40445816/… Commented Jan 7, 2020 at 13:57
  • the answer specified there doesn't seem to support my usecase, as per the example i have to extract the fields i need, but for me i want the whole xml to converted to nested structure Commented Jan 7, 2020 at 18:54
  • Does this answer your question? Read XML in spark Commented Jan 8, 2020 at 16:11

1 Answer 1

0

Looks like this can be achieved using XmlReader (but only in scala)

val rdd:RDD[String] = df.select("value").as[String].rdd
var schema: StructType = null
var parameters = collection.mutable.Map("rowTag" -> "bks:books", "valueTag" -> "_ele_value")

val new_df = new XmlReader().withRowTag("bks:books").withValueTag("_ele_value").withSchema(schema).xmlRdd(spark, rdd)

But the problem in this approach is, we lose any relationship between value and other columns in the initial dataframe

If anyone knows a way to link them let me know :)

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.