0

sample line:

{"expand":"names,schema","centralid":10,"centralloc":"balh",components:["com.atlassian.greenhopper.service.sprint.Sprint@89322d3[id=123,rapidViewId=321,state=CLOSED,name=Sprint 30 - \"abc\",startDate=2018-03-09T16:04:40.666+11:00,endDate=2018-03-23T16:04:00.000+11:00,completeDate=2018-03-23T14:12:44.680+11:00,sequence=980]","com.atlassian.greenhopper.service.sprint.Sprint@42e71215[id=456,rapidViewId=654,state=CLOSED,name=Sprint 31 - \"abc\",startDate=2018-03-23T14:57:17.889+11:00,endDate=2018-04-06T14:57:00.000+10:00,completeDate=2018-04-06T15:05:27.638+10:00,sequence=974]","com.atlassian.greenhopper.service.sprint.Sprint@226753d[id=789,rapidViewId=987,state=CLOSED,name=Sprint 32 - \"xyz\",startDate=2018-04-06T15:43:52.118+10:00,endDate=2018-04-20T15:43:00.000+10:00,completeDate=2018-04-20T14:06:26.892+10:00,sequence=990]","com.atlassian.greenhopper.service.sprint.Sprint@74bcf2de[id=101,rapidViewId=101,state=CLOSED,name=Sprint 33 - \"lmnop\",startDate=2018-04-20T15:54:01.418+10:00,endDate=2018-05-04T15:54:00.000+10:00,completeDate=2018-05-04T15:06:45.374+10:00,sequence=999]"]

I am looking to explode above line to become individual rows and to extract below columns. I am using sparksql to explode and get the output as below. centralid, centralloc columns will be repeated for each row in components

centralid, centralloc, components.id , components.rapidViewId, components.state, components.name, components.startDate, components.endDate, components.completeDate, components.sequence

Please share your ideas. Can you please try and use regex and also let me know how to

Edited to suit the requirement.

3 Answers 3

1

This is one of the approaches that you can try. Suppose you are getting the output as a list of String.

val input = List("com.atlassian.greenhopper.service.sprint.Sprint@89322d3[id=123,rapidViewId=321,state=CLOSED,name=Sprint 30 - \"abc\",startDate=2018-03-09T16:04:40.666+11:00,endDate=2018-03-23T16:04:00.000+11:00,completeDate=2018-03-23T14:12:44.680+11:00,sequence=980]",

  "com.atlassian.greenhopper.service.sprint.Sprint@42e71215[id=456,rapidViewId=654,state=CLOSED,name=Sprint 31 - \"abc\",startDate=2018-03-23T14:57:17.889+11:00,endDate=2018-04-06T14:57:00.000+10:00,completeDate=2018-04-06T15:05:27.638+10:00,sequence=974]",

  "com.atlassian.greenhopper.service.sprint.Sprint@226753d[id=789,rapidViewId=987,state=CLOSED,name=Sprint 32 - \"xyz\",startDate=2018-04-06T15:43:52.118+10:00,endDate=2018-04-20T15:43:00.000+10:00,completeDate=2018-04-20T14:06:26.892+10:00,sequence=990]",

  "com.atlassian.greenhopper.service.sprint.Sprint@74bcf2de[id=101,rapidViewId=101,state=CLOSED,name=Sprint 33 - \"lmnop\",startDate=2018-04-20T15:54:01.418+10:00,endDate=2018-05-04T15:54:00.000+10:00,completeDate=2018-05-04T15:06:45.374+10:00,sequence=999]")

You can declare a case class in scala to extract certail fields from it.

case class Output(id:String,rapidViewId:String, state:String, name:String, startDate:String, endDate:String, completeDate:String, sequence:String)

You can now get the final case class list from which you can extract the fields that you want.

val result = input.map{
  x =>
    val intermediateResult = x.split("\\[")(1).split("\\,")
    Output(intermediateResult(0),intermediateResult(1),intermediateResult(2),intermediateResult(3),intermediateResult(4),intermediateResult(5),intermediateResult(6),intermediateResult(7).replaceAll("\\]",""))

}

You will get the result in the format

result: List[Output] = List(Output(id=123,rapidViewId=321,state=CLOSED,name=Sprint 30 - "abc",startDate=2018-03-09T16:04:40.666+11:00,endDate=2018-03-23T16:04:00.000+11:00,completeDate=2018-03-23T14:12:44.680+11:00,sequence=980), Output(id=456,rapidViewId=654,state=CLOSED,name=Sprint 31 - "abc",startDate=2018-03-23T14:57:17.889+11:00,endDate=2018-04-06T14:57:00.000+10:00,completeDate=2018-04-06T15:05:27.638+10:00,sequence=974), Output(id=789,rapidViewId=987,state=CLOSED,name=Sprint 32 - "xyz",startDate=2018-04-06T15:43:52.118+10:00,endDate=2018-04-20T15:43:00.000+10:00,completeDate=2018-04-20T14:06:26.892+10:00,sequence=990), Output(id=101,rapidViewId=101,state=CLOSED,name=Sprint 33 - "lmnop",startDate=2018-04-20T15:54:01.418+10:00,endDate=2018-05-04T15:54:00.000+10:00,completeDate=2018-05-04T15:06:45.374+10:00,sequence=999))

from which you can extract the fiels as and how you need.

This is one of the approaches that you can use. Please let me know if you have any further doubts. I would be happy to clarify them.

Sign up to request clarification or add additional context in comments.

Comments

1

Here is some code for your reference, you might want to use regex to make it better:

val c = """["com.atlassian.greenhopper.service.sprint.Sprint@89322d3[id=123,rapidViewId=321,state=CLOSED,name=Sprint 30 - \"abc\",startDate=2018-03-09T16:04:40.666+11:00,endDate=2018-03-23T16:04:00.000+11:00,completeDate=2018-03-23T14:12:44.680+11:00,sequence=980]","com.atlassian.greenhopper.service.sprint.Sprint@42e71215[id=456,rapidViewId=654,state=CLOSED,name=Sprint 31 - \"abc\",startDate=2018-03-23T14:57:17.889+11:00,endDate=2018-04-06T14:57:00.000+10:00,completeDate=2018-04-06T15:05:27.638+10:00,sequence=974]","com.atlassian.greenhopper.service.sprint.Sprint@226753d[id=789,rapidViewId=987,state=CLOSED,name=Sprint 32 - \"xyz\",startDate=2018-04-06T15:43:52.118+10:00,endDate=2018-04-20T15:43:00.000+10:00,completeDate=2018-04-20T14:06:26.892+10:00,sequence=990]","com.atlassian.greenhopper.service.sprint.Sprint@74bcf2de[id=101,rapidViewId=101,state=CLOSED,name=Sprint 33 - \"lmnop\",startDate=2018-04-20T15:54:01.418+10:00,endDate=2018-05-04T15:54:00.000+10:00,completeDate=2018-05-04T15:06:45.374+10:00,sequence=999]"]"""

c.split(",").
  flatMap(_.split('[')).
  flatMap(_.split(']')).
  filter(_.contains('=')).
  map(_.split('=')(1)).
  grouped(8).
  map(_.toList).
  foreach(println)

The output is like below:

List(456, 654, CLOSED, Sprint 31 - \"abc\", 2018-03-23T14:57:17.889+11:00, 2018-04-06T14:57:00.000+10:00, 2018-04-06T15:05:27.638+10:00, 974)
List(789, 987, CLOSED, Sprint 32 - \"xyz\", 2018-04-06T15:43:52.118+10:00, 2018-04-20T15:43:00.000+10:00, 2018-04-20T14:06:26.892+10:00, 990)
List(101, 101, CLOSED, Sprint 33 - \"lmnop\", 2018-04-20T15:54:01.418+10:00, 2018-05-04T15:54:00.000+10:00, 2018-05-04T15:06:45.374+10:00, 999)

Comments

1

Please look at the comments for explanations.

//string definition as in the question
val str = """["com.atlassian.greenhopper.service.sprint.Sprint@89322d3[id=123,rapidViewId=321,state=CLOSED,name=Sprint 30 - \"abc\",startDate=2018-03-09T16:04:40.666+11:00,endDate=2018-03-23T16:04:00.000+11:00,completeDate=2018-03-23T14:12:44.680+11:00,sequence=980]","com.atlassian.greenhopper.service.sprint.Sprint@42e71215[id=456,rapidViewId=654,state=CLOSED,name=Sprint 31 - \"abc\",startDate=2018-03-23T14:57:17.889+11:00,endDate=2018-04-06T14:57:00.000+10:00,completeDate=2018-04-06T15:05:27.638+10:00,sequence=974]","com.atlassian.greenhopper.service.sprint.Sprint@226753d[id=789,rapidViewId=987,state=CLOSED,name=Sprint 32 - \"xyz\",startDate=2018-04-06T15:43:52.118+10:00,endDate=2018-04-20T15:43:00.000+10:00,completeDate=2018-04-20T14:06:26.892+10:00,sequence=990]","com.atlassian.greenhopper.service.sprint.Sprint@74bcf2de[id=101,rapidViewId=101,state=CLOSED,name=Sprint 33 - \"lmnop\",startDate=2018-04-20T15:54:01.418+10:00,endDate=2018-05-04T15:54:00.000+10:00,completeDate=2018-05-04T15:06:45.374+10:00,sequence=999]"]"""
//parsing above string to get line by line data
val parsed = str.split("\",\"").map(line => line.substring(line.indexOf("[id="), line.length).replace("\"]", "").replaceAll("[\\[\\]]", ""))
//taking one line and forming a schema with field names before = sign
val schema = StructType(parsed(0).split(",").map(field => StructField(field.split("=")(0), StringType, true)))
//converting the parsed string to rdd by taking the values after = sign
val rdd = sc.parallelize(parsed.map(line => Row.fromSeq(line.split(",").map(field => field.split("=")(1)))))
//finally creating the desired dataframe
val df = sqlContext.createDataFrame(rdd, schema)
df.show(false)

which should give you

+---+-----------+------+---------------------+-----------------------------+-----------------------------+-----------------------------+--------+
|id |rapidViewId|state |name                 |startDate                    |endDate                      |completeDate                 |sequence|
+---+-----------+------+---------------------+-----------------------------+-----------------------------+-----------------------------+--------+
|123|321        |CLOSED|Sprint 30 - \"abc\"  |2018-03-09T16:04:40.666+11:00|2018-03-23T16:04:00.000+11:00|2018-03-23T14:12:44.680+11:00|980     |
|456|654        |CLOSED|Sprint 31 - \"abc\"  |2018-03-23T14:57:17.889+11:00|2018-04-06T14:57:00.000+10:00|2018-04-06T15:05:27.638+10:00|974     |
|789|987        |CLOSED|Sprint 32 - \"xyz\"  |2018-04-06T15:43:52.118+10:00|2018-04-20T15:43:00.000+10:00|2018-04-20T14:06:26.892+10:00|990     |
|101|101        |CLOSED|Sprint 33 - \"lmnop\"|2018-04-20T15:54:01.418+10:00|2018-05-04T15:54:00.000+10:00|2018-05-04T15:06:45.374+10:00|999     |
+---+-----------+------+---------------------+-----------------------------+-----------------------------+-----------------------------+--------+

updated

Since you updated your question with new input string and with new header, you will have to adjust some changes in the above proposed guide which can be as following

//string definition as in the question
val str = """{"expand":"names,schema","centralid":10,"centralloc":"balh",components:["com.atlassian.greenhopper.service.sprint.Sprint@89322d3[id=123,rapidViewId=321,state=CLOSED,name=Sprint 30 - \"abc\",startDate=2018-03-09T16:04:40.666+11:00,endDate=2018-03-23T16:04:00.000+11:00,completeDate=2018-03-23T14:12:44.680+11:00,sequence=980]","com.atlassian.greenhopper.service.sprint.Sprint@42e71215[id=456,rapidViewId=654,state=CLOSED,name=Sprint 31 - \"abc\",startDate=2018-03-23T14:57:17.889+11:00,endDate=2018-04-06T14:57:00.000+10:00,completeDate=2018-04-06T15:05:27.638+10:00,sequence=974]","com.atlassian.greenhopper.service.sprint.Sprint@226753d[id=789,rapidViewId=987,state=CLOSED,name=Sprint 32 - \"xyz\",startDate=2018-04-06T15:43:52.118+10:00,endDate=2018-04-20T15:43:00.000+10:00,completeDate=2018-04-20T14:06:26.892+10:00,sequence=990]","com.atlassian.greenhopper.service.sprint.Sprint@74bcf2de[id=101,rapidViewId=101,state=CLOSED,name=Sprint 33 - \"lmnop\",startDate=2018-04-20T15:54:01.418+10:00,endDate=2018-05-04T15:54:00.000+10:00,completeDate=2018-05-04T15:06:45.374+10:00,sequence=999]"]"""

val initialParsing = str.split(":\\[")
//parsing above string to get line by line data
val parsed = initialParsing(1).split("\",\"").map(line => {
  val initialSplitted = initialParsing(0).split(",")
  Seq(initialSplitted(2).replace(":", "="), initialSplitted(3).replace(":", "=")) ++ line.substring(line.indexOf("[id="), line.length).replace("\"]", "").replaceAll("[\\[\\]]", "").split(",").map(initialSplitted(4)+"."+_)
})
//taking one line and forming a schema with field names before = sign
val schema = StructType(parsed(0).map(field => StructField(field.split("=")(0).replace("\"", ""), StringType, true)))
//converting the parsed string to rdd by taking the values after = sign
val rdd = sc.parallelize(parsed.map(line => Row.fromSeq(line.map(field => field.split("=")(1)))))
//finally creating the desired dataframe
val df = sqlContext.createDataFrame(rdd, schema)
df.show(false)

which should give you

+---------+----------+-------------+----------------------+----------------+---------------------+-----------------------------+-----------------------------+-----------------------------+-------------------+
|centralid|centralloc|components.id|components.rapidViewId|components.state|components.name      |components.startDate         |components.endDate           |components.completeDate      |components.sequence|
+---------+----------+-------------+----------------------+----------------+---------------------+-----------------------------+-----------------------------+-----------------------------+-------------------+
|10       |"balh"    |123          |321                   |CLOSED          |Sprint 30 - \"abc\"  |2018-03-09T16:04:40.666+11:00|2018-03-23T16:04:00.000+11:00|2018-03-23T14:12:44.680+11:00|980                |
|10       |"balh"    |456          |654                   |CLOSED          |Sprint 31 - \"abc\"  |2018-03-23T14:57:17.889+11:00|2018-04-06T14:57:00.000+10:00|2018-04-06T15:05:27.638+10:00|974                |
|10       |"balh"    |789          |987                   |CLOSED          |Sprint 32 - \"xyz\"  |2018-04-06T15:43:52.118+10:00|2018-04-20T15:43:00.000+10:00|2018-04-20T14:06:26.892+10:00|990                |
|10       |"balh"    |101          |101                   |CLOSED          |Sprint 33 - \"lmnop\"|2018-04-20T15:54:01.418+10:00|2018-05-04T15:54:00.000+10:00|2018-05-04T15:06:45.374+10:00|999                |
+---------+----------+-------------+----------------------+----------------+---------------------+-----------------------------+-----------------------------+-----------------------------+-------------------+

I hope the answer is helpful to guide you through rest of your work.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.