1

I want to parser xml using spark so I am using spark databricks lib. sample xml is as follows:

<Transactions>
        <Transaction>                
                <transid>1111</transid>                
        </Transaction>  
        <Transaction>                
                <transid>2222</transid>                
        </Transaction>      
</Transactions>
<Payments>
    <Payment>
        <Id>123</Id>
    </Payment>
    <Payment>
        <Id>456</Id>
    </Payment>
</Payments>

code to parse:

val transNestedDF = sqlContext.read.format("com.databricks.spark.xml").option("rowTag","Transactions").load("trans_nested.xml")

transNestedDF.registerTempTable("TransNestedTbl")

sqlContext.sql("select Transaction[0].transid from TransNestedTbl").collect()

Here I don't have any root tag also I can't define multiple row tags so if I have to process both transactions and payments in single read using above single dataframe then how to achieve that?

need help.

8
  • Can you use python libraries to handle this? Commented Sep 13, 2019 at 17:53
  • if its tough in scala then python i can try Commented Sep 13, 2019 at 17:58
  • I'm afraid I don't know scala, but it can probably be done with python or xpath/xquery expressions. Commented Sep 13, 2019 at 18:21
  • ok can u show some sample code for python to handle above scenario .. I can try Commented Sep 18, 2019 at 15:03
  • What exactly is your expected output? Commented Sep 19, 2019 at 23:59

2 Answers 2

2

Let's try this with lxml, a python library, which itself uses xpath:

If you don't have it installed, you need to:

pip intall lxml

then:

import lxml.html

pay = """ [your code above]  """

doc = lxml.html.fromstring(pay)
tid =doc.xpath('Transactions//transid'.lower()) #or ('//Transactions//transid'.lower()) depending on the structure of the original doc
pid = doc.xpath('Payments//id'.lower()) #same comment

final = ''
for i in tid:
    for p in pid:            
        final = final+i.text+'|'+p.text+' \n'

print(final)

Output:

1111|123 
1111|456 
2222|123 
2222|456 
Sign up to request clarification or add additional context in comments.

Comments

1

You can't do it in one read, if there is no tag around both of these. If there is any common parent tag, you can use that as rowTag and ignore the rest that is parsed.

You can of course read them separately into two DataFrames. That works fine if you treat them separately. But you lose the association between transactions and payments, unless you can join on some ID.

But then I'd wonder why the XML structure doesn't have any common parent if these are associated.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.