0

I have a stream of xml records which I process in scala using hadoopRDD and finally save in a single file However I need to sort those XMLs based on certain attributes before saving them in output file.

I thought of creating List with xml value and xml like below

Input

<Transaction>
    <eventid>1234<eventId/>
    <eventName>hello<eventName/>
    .......
<Transaction/>
<Transaction>
    <eventid>2345<eventId/>
    <eventName>hi<eventName/>
    .......
<Transaction/>

--- and so on

My idea is to create a list as {(1234, xml1),(2345,xml2)....} , sort on first element and save the second element to output file.

How can this be done in Scala , or is there a better approach to do this Thanks in advance for your suggestions and help

1

1 Answer 1

1

I was able to figure it out like below: First, I have created a function to extract eventId from xml, returning both eventId and xml:

val rdd = input.map {x => (geteventId(x) , x)}

Then I sorted on eventId and extracted only xml and saved on hdfs:

val result = rdd.soryBy(x => x._1).map(x => x._2)

geteventId(x) is used by parsing xml to get the value for eventId.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.