scala Function to explode data in a specific array to extract columns

Question

sample line:

{"expand":"names,schema","centralid":10,"centralloc":"balh",components:["com.atlassian.greenhopper.service.sprint.Sprint@89322d3[id=123,rapidViewId=321,state=CLOSED,name=Sprint 30 - \"abc\",startDate=2018-03-09T16:04:40.666+11:00,endDate=2018-03-23T16:04:00.000+11:00,completeDate=2018-03-23T14:12:44.680+11:00,sequence=980]","com.atlassian.greenhopper.service.sprint.Sprint@42e71215[id=456,rapidViewId=654,state=CLOSED,name=Sprint 31 - \"abc\",startDate=2018-03-23T14:57:17.889+11:00,endDate=2018-04-06T14:57:00.000+10:00,completeDate=2018-04-06T15:05:27.638+10:00,sequence=974]","com.atlassian.greenhopper.service.sprint.Sprint@226753d[id=789,rapidViewId=987,state=CLOSED,name=Sprint 32 - \"xyz\",startDate=2018-04-06T15:43:52.118+10:00,endDate=2018-04-20T15:43:00.000+10:00,completeDate=2018-04-20T14:06:26.892+10:00,sequence=990]","com.atlassian.greenhopper.service.sprint.Sprint@74bcf2de[id=101,rapidViewId=101,state=CLOSED,name=Sprint 33 - \"lmnop\",startDate=2018-04-20T15:54:01.418+10:00,endDate=2018-05-04T15:54:00.000+10:00,completeDate=2018-05-04T15:06:45.374+10:00,sequence=999]"]

I am looking to explode above line to become individual rows and to extract below columns. I am using sparksql to explode and get the output as below. centralid, centralloc columns will be repeated for each row in components

centralid, centralloc, components.id , components.rapidViewId, components.state, components.name, components.startDate, components.endDate, components.completeDate, components.sequence

Please share your ideas. Can you please try and use regex and also let me know how to

Edited to suit the requirement.

Chaitanya · Accepted Answer · 2018-05-08 05:30:08Z

This is one of the approaches that you can try. Suppose you are getting the output as a list of String.

val input = List("com.atlassian.greenhopper.service.sprint.Sprint@89322d3[id=123,rapidViewId=321,state=CLOSED,name=Sprint 30 - \"abc\",startDate=2018-03-09T16:04:40.666+11:00,endDate=2018-03-23T16:04:00.000+11:00,completeDate=2018-03-23T14:12:44.680+11:00,sequence=980]",

  "com.atlassian.greenhopper.service.sprint.Sprint@42e71215[id=456,rapidViewId=654,state=CLOSED,name=Sprint 31 - \"abc\",startDate=2018-03-23T14:57:17.889+11:00,endDate=2018-04-06T14:57:00.000+10:00,completeDate=2018-04-06T15:05:27.638+10:00,sequence=974]",

  "com.atlassian.greenhopper.service.sprint.Sprint@226753d[id=789,rapidViewId=987,state=CLOSED,name=Sprint 32 - \"xyz\",startDate=2018-04-06T15:43:52.118+10:00,endDate=2018-04-20T15:43:00.000+10:00,completeDate=2018-04-20T14:06:26.892+10:00,sequence=990]",

  "com.atlassian.greenhopper.service.sprint.Sprint@74bcf2de[id=101,rapidViewId=101,state=CLOSED,name=Sprint 33 - \"lmnop\",startDate=2018-04-20T15:54:01.418+10:00,endDate=2018-05-04T15:54:00.000+10:00,completeDate=2018-05-04T15:06:45.374+10:00,sequence=999]")

You can declare a case class in scala to extract certail fields from it.

case class Output(id:String,rapidViewId:String, state:String, name:String, startDate:String, endDate:String, completeDate:String, sequence:String)

You can now get the final case class list from which you can extract the fields that you want.

val result = input.map{
  x =>
    val intermediateResult = x.split("\\[")(1).split("\\,")
    Output(intermediateResult(0),intermediateResult(1),intermediateResult(2),intermediateResult(3),intermediateResult(4),intermediateResult(5),intermediateResult(6),intermediateResult(7).replaceAll("\\]",""))

}

You will get the result in the format

result: List[Output] = List(Output(id=123,rapidViewId=321,state=CLOSED,name=Sprint 30 - "abc",startDate=2018-03-09T16:04:40.666+11:00,endDate=2018-03-23T16:04:00.000+11:00,completeDate=2018-03-23T14:12:44.680+11:00,sequence=980), Output(id=456,rapidViewId=654,state=CLOSED,name=Sprint 31 - "abc",startDate=2018-03-23T14:57:17.889+11:00,endDate=2018-04-06T14:57:00.000+10:00,completeDate=2018-04-06T15:05:27.638+10:00,sequence=974), Output(id=789,rapidViewId=987,state=CLOSED,name=Sprint 32 - "xyz",startDate=2018-04-06T15:43:52.118+10:00,endDate=2018-04-20T15:43:00.000+10:00,completeDate=2018-04-20T14:06:26.892+10:00,sequence=990), Output(id=101,rapidViewId=101,state=CLOSED,name=Sprint 33 - "lmnop",startDate=2018-04-20T15:54:01.418+10:00,endDate=2018-05-04T15:54:00.000+10:00,completeDate=2018-05-04T15:06:45.374+10:00,sequence=999))

from which you can extract the fiels as and how you need.

This is one of the approaches that you can use. Please let me know if you have any further doubts. I would be happy to clarify them.

Binzi Cao · Accepted Answer · 2018-05-08 05:43:16Z

Here is some code for your reference, you might want to use regex to make it better:

val c = """["com.atlassian.greenhopper.service.sprint.Sprint@89322d3[id=123,rapidViewId=321,state=CLOSED,name=Sprint 30 - \"abc\",startDate=2018-03-09T16:04:40.666+11:00,endDate=2018-03-23T16:04:00.000+11:00,completeDate=2018-03-23T14:12:44.680+11:00,sequence=980]","com.atlassian.greenhopper.service.sprint.Sprint@42e71215[id=456,rapidViewId=654,state=CLOSED,name=Sprint 31 - \"abc\",startDate=2018-03-23T14:57:17.889+11:00,endDate=2018-04-06T14:57:00.000+10:00,completeDate=2018-04-06T15:05:27.638+10:00,sequence=974]","com.atlassian.greenhopper.service.sprint.Sprint@226753d[id=789,rapidViewId=987,state=CLOSED,name=Sprint 32 - \"xyz\",startDate=2018-04-06T15:43:52.118+10:00,endDate=2018-04-20T15:43:00.000+10:00,completeDate=2018-04-20T14:06:26.892+10:00,sequence=990]","com.atlassian.greenhopper.service.sprint.Sprint@74bcf2de[id=101,rapidViewId=101,state=CLOSED,name=Sprint 33 - \"lmnop\",startDate=2018-04-20T15:54:01.418+10:00,endDate=2018-05-04T15:54:00.000+10:00,completeDate=2018-05-04T15:06:45.374+10:00,sequence=999]"]"""

c.split(",").
  flatMap(_.split('[')).
  flatMap(_.split(']')).
  filter(_.contains('=')).
  map(_.split('=')(1)).
  grouped(8).
  map(_.toList).
  foreach(println)

The output is like below:

List(456, 654, CLOSED, Sprint 31 - \"abc\", 2018-03-23T14:57:17.889+11:00, 2018-04-06T14:57:00.000+10:00, 2018-04-06T15:05:27.638+10:00, 974)
List(789, 987, CLOSED, Sprint 32 - \"xyz\", 2018-04-06T15:43:52.118+10:00, 2018-04-20T15:43:00.000+10:00, 2018-04-20T14:06:26.892+10:00, 990)
List(101, 101, CLOSED, Sprint 33 - \"lmnop\", 2018-04-20T15:54:01.418+10:00, 2018-05-04T15:54:00.000+10:00, 2018-05-04T15:06:45.374+10:00, 999)

Anahcolus · Accepted Answer · 2018-05-08 10:02:37Z

Please look at the comments for explanations.

//string definition as in the question
val str = """["com.atlassian.greenhopper.service.sprint.Sprint@89322d3[id=123,rapidViewId=321,state=CLOSED,name=Sprint 30 - \"abc\",startDate=2018-03-09T16:04:40.666+11:00,endDate=2018-03-23T16:04:00.000+11:00,completeDate=2018-03-23T14:12:44.680+11:00,sequence=980]","com.atlassian.greenhopper.service.sprint.Sprint@42e71215[id=456,rapidViewId=654,state=CLOSED,name=Sprint 31 - \"abc\",startDate=2018-03-23T14:57:17.889+11:00,endDate=2018-04-06T14:57:00.000+10:00,completeDate=2018-04-06T15:05:27.638+10:00,sequence=974]","com.atlassian.greenhopper.service.sprint.Sprint@226753d[id=789,rapidViewId=987,state=CLOSED,name=Sprint 32 - \"xyz\",startDate=2018-04-06T15:43:52.118+10:00,endDate=2018-04-20T15:43:00.000+10:00,completeDate=2018-04-20T14:06:26.892+10:00,sequence=990]","com.atlassian.greenhopper.service.sprint.Sprint@74bcf2de[id=101,rapidViewId=101,state=CLOSED,name=Sprint 33 - \"lmnop\",startDate=2018-04-20T15:54:01.418+10:00,endDate=2018-05-04T15:54:00.000+10:00,completeDate=2018-05-04T15:06:45.374+10:00,sequence=999]"]"""
//parsing above string to get line by line data
val parsed = str.split("\",\"").map(line => line.substring(line.indexOf("[id="), line.length).replace("\"]", "").replaceAll("[\\[\\]]", ""))
//taking one line and forming a schema with field names before = sign
val schema = StructType(parsed(0).split(",").map(field => StructField(field.split("=")(0), StringType, true)))
//converting the parsed string to rdd by taking the values after = sign
val rdd = sc.parallelize(parsed.map(line => Row.fromSeq(line.split(",").map(field => field.split("=")(1)))))
//finally creating the desired dataframe
val df = sqlContext.createDataFrame(rdd, schema)
df.show(false)

which should give you

+---+-----------+------+---------------------+-----------------------------+-----------------------------+-----------------------------+--------+
|id |rapidViewId|state |name                 |startDate                    |endDate                      |completeDate                 |sequence|
+---+-----------+------+---------------------+-----------------------------+-----------------------------+-----------------------------+--------+
|123|321        |CLOSED|Sprint 30 - \"abc\"  |2018-03-09T16:04:40.666+11:00|2018-03-23T16:04:00.000+11:00|2018-03-23T14:12:44.680+11:00|980     |
|456|654        |CLOSED|Sprint 31 - \"abc\"  |2018-03-23T14:57:17.889+11:00|2018-04-06T14:57:00.000+10:00|2018-04-06T15:05:27.638+10:00|974     |
|789|987        |CLOSED|Sprint 32 - \"xyz\"  |2018-04-06T15:43:52.118+10:00|2018-04-20T15:43:00.000+10:00|2018-04-20T14:06:26.892+10:00|990     |
|101|101        |CLOSED|Sprint 33 - \"lmnop\"|2018-04-20T15:54:01.418+10:00|2018-05-04T15:54:00.000+10:00|2018-05-04T15:06:45.374+10:00|999     |
+---+-----------+------+---------------------+-----------------------------+-----------------------------+-----------------------------+--------+

updated

Since you updated your question with new input string and with new header, you will have to adjust some changes in the above proposed guide which can be as following

//string definition as in the question
val str = """{"expand":"names,schema","centralid":10,"centralloc":"balh",components:["com.atlassian.greenhopper.service.sprint.Sprint@89322d3[id=123,rapidViewId=321,state=CLOSED,name=Sprint 30 - \"abc\",startDate=2018-03-09T16:04:40.666+11:00,endDate=2018-03-23T16:04:00.000+11:00,completeDate=2018-03-23T14:12:44.680+11:00,sequence=980]","com.atlassian.greenhopper.service.sprint.Sprint@42e71215[id=456,rapidViewId=654,state=CLOSED,name=Sprint 31 - \"abc\",startDate=2018-03-23T14:57:17.889+11:00,endDate=2018-04-06T14:57:00.000+10:00,completeDate=2018-04-06T15:05:27.638+10:00,sequence=974]","com.atlassian.greenhopper.service.sprint.Sprint@226753d[id=789,rapidViewId=987,state=CLOSED,name=Sprint 32 - \"xyz\",startDate=2018-04-06T15:43:52.118+10:00,endDate=2018-04-20T15:43:00.000+10:00,completeDate=2018-04-20T14:06:26.892+10:00,sequence=990]","com.atlassian.greenhopper.service.sprint.Sprint@74bcf2de[id=101,rapidViewId=101,state=CLOSED,name=Sprint 33 - \"lmnop\",startDate=2018-04-20T15:54:01.418+10:00,endDate=2018-05-04T15:54:00.000+10:00,completeDate=2018-05-04T15:06:45.374+10:00,sequence=999]"]"""

val initialParsing = str.split(":\\[")
//parsing above string to get line by line data
val parsed = initialParsing(1).split("\",\"").map(line => {
  val initialSplitted = initialParsing(0).split(",")
  Seq(initialSplitted(2).replace(":", "="), initialSplitted(3).replace(":", "=")) ++ line.substring(line.indexOf("[id="), line.length).replace("\"]", "").replaceAll("[\\[\\]]", "").split(",").map(initialSplitted(4)+"."+_)
})
//taking one line and forming a schema with field names before = sign
val schema = StructType(parsed(0).map(field => StructField(field.split("=")(0).replace("\"", ""), StringType, true)))
//converting the parsed string to rdd by taking the values after = sign
val rdd = sc.parallelize(parsed.map(line => Row.fromSeq(line.map(field => field.split("=")(1)))))
//finally creating the desired dataframe
val df = sqlContext.createDataFrame(rdd, schema)
df.show(false)

which should give you

+---------+----------+-------------+----------------------+----------------+---------------------+-----------------------------+-----------------------------+-----------------------------+-------------------+
|centralid|centralloc|components.id|components.rapidViewId|components.state|components.name      |components.startDate         |components.endDate           |components.completeDate      |components.sequence|
+---------+----------+-------------+----------------------+----------------+---------------------+-----------------------------+-----------------------------+-----------------------------+-------------------+
|10       |"balh"    |123          |321                   |CLOSED          |Sprint 30 - \"abc\"  |2018-03-09T16:04:40.666+11:00|2018-03-23T16:04:00.000+11:00|2018-03-23T14:12:44.680+11:00|980                |
|10       |"balh"    |456          |654                   |CLOSED          |Sprint 31 - \"abc\"  |2018-03-23T14:57:17.889+11:00|2018-04-06T14:57:00.000+10:00|2018-04-06T15:05:27.638+10:00|974                |
|10       |"balh"    |789          |987                   |CLOSED          |Sprint 32 - \"xyz\"  |2018-04-06T15:43:52.118+10:00|2018-04-20T15:43:00.000+10:00|2018-04-20T14:06:26.892+10:00|990                |
|10       |"balh"    |101          |101                   |CLOSED          |Sprint 33 - \"lmnop\"|2018-04-20T15:54:01.418+10:00|2018-05-04T15:54:00.000+10:00|2018-05-04T15:06:45.374+10:00|999                |
+---------+----------+-------------+----------------------+----------------+---------------------+-----------------------------+-----------------------------+-----------------------------+-------------------+

I hope the answer is helpful to guide you through rest of your work.

Collectives™ on Stack Overflow

scala Function to explode data in a specific array to extract columns

3 Answers 3

Comments

Comments

updated

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

updated

Comments

Your Answer

Sign up or log in

Post as a guest

Related