parse xml with multiple rowtags using spark

Question

I want to parser xml using spark so I am using spark databricks lib. sample xml is as follows:

<Transactions>
        <Transaction>                
                <transid>1111</transid>                
        </Transaction>  
        <Transaction>                
                <transid>2222</transid>                
        </Transaction>      
</Transactions>
<Payments>
    <Payment>
        <Id>123</Id>
    </Payment>
    <Payment>
        <Id>456</Id>
    </Payment>
</Payments>

code to parse:

val transNestedDF = sqlContext.read.format("com.databricks.spark.xml").option("rowTag","Transactions").load("trans_nested.xml")

transNestedDF.registerTempTable("TransNestedTbl")

sqlContext.sql("select Transaction[0].transid from TransNestedTbl").collect()

Here I don't have any root tag also I can't define multiple row tags so if I have to process both transactions and payments in single read using above single dataframe then how to achieve that?

need help.

I'm afraid I don't know scala, but it can probably be done with python or xpath/xquery expressions. — Jack Fleeting
– Jack Fleeting, Commented Sep 13, 2019 at 18:21
ok can u show some sample code for python to handle above scenario .. I can try — ashwini
– ashwini, Commented Sep 18, 2019 at 15:03

Jack Fleeting · Accepted Answer · 2019-09-20 16:32:56Z

2

Let's try this with lxml, a python library, which itself uses xpath:

If you don't have it installed, you need to:

pip intall lxml

then:

import lxml.html

pay = """ [your code above]  """

doc = lxml.html.fromstring(pay)
tid =doc.xpath('Transactions//transid'.lower()) #or ('//Transactions//transid'.lower()) depending on the structure of the original doc
pid = doc.xpath('Payments//id'.lower()) #same comment

final = ''
for i in tid:
    for p in pid:            
        final = final+i.text+'|'+p.text+' \n'

print(final)

Output:

answered Sep 20, 2019 at 16:32

Jack Fleeting

25k6 gold badges27 silver badges49 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Sean Owen · Accepted Answer · 2020-08-31 18:44:23Z

1

You can't do it in one read, if there is no tag around both of these. If there is any common parent tag, you can use that as rowTag and ignore the rest that is parsed.

You can of course read them separately into two DataFrames. That works fine if you treat them separately. But you lose the association between transactions and payments, unless you can join on some ID.

But then I'd wonder why the XML structure doesn't have any common parent if these are associated.

answered Aug 31, 2020 at 18:44

Sean Owen

67k23 gold badges144 silver badges175 bronze badges

Collectives™ on Stack Overflow

parse xml with multiple rowtags using spark

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related