1

Suppose we are given dataset ("DATA") like :

YEAR | FIRST NAME | LAST NAME | VARIABLES
2008 | JOY        | ANDERSON  | spark|python|scala; 45;w/o sports;w datascience
2008 | STEVEN     | JOHNSON   | Spark|R; 90|56
2006 | NIHA       | DIVA      | w/o sports

and we have another dataset ("RESULT") like :

YEAR | FIRST NAME | LAST NAME 
1992 | EMMA       | CENA 
2008 | JOY        | ANDERSON
2008 | STEVEN     | ANDERSON
2006 | NIHA       | DIVA
and so on.

The output should be ("RESULT") :

YEAR | FIRST NAME | LAST NAME | SUBJECT | SCORE | SPORTS | DATASCIENCE
1992 | EMMA       | CENA      |         |       |        |              
2008 | JOY        | ANDERSON  | SPARK   | 45    | FALSE  | TRUE
2008 | JOY        | ANDERSON  | PYTHON  | 45    | FALSE  | TRUE
2008 | JOY        | ANDERSON  | SCALA   | 45    | FALSE  | TRUE
2008 | STEVEN     | ANDERSON  |         |       |        | 
2006 | NIHA       | DIVA      |         |       | FALSE  | 
2008 | STEVEN     | JOHNSON   | SPARK   | 90    |        |
2008 | STEVEN     | JOHNSON   | SPARK   | 56    |        |
2008 | STEVEN     | JOHNSON   | R       | 90    |        |
2008 | STEVEN     | JOHNSON   | R       | 56    |        |
and so on. 

Please note that there are some rows in DATA which are not present in RESULT and vice-versa. For eg - "2008,STEVEN,JOHNSON" is not present in RESULT but is present in DATA. And the entries should be made in RESULT dataset. The columns {SUBJECT, SCORE, SPORTS, DATASCIENCE} are made by my intuition that "spark" refers to the SUBJECT and so on. Hope you understand my query. And I am using spark-shell with spark dataframes. Note that "Spark" and "spark" should be considered as same.

12
  • 1
    Seems very similar to your previous question, which was answered perfectly: stackoverflow.com/questions/40146760/… - you should be able to extrapolate that answer to this. Otherwise, please give a minimal example and show the code you've tried so far. Commented Oct 20, 2016 at 12:46
  • yes it was similar but not exactly what I want to ask coz I have two different datasets and I want mapping from DATA --> RESULT. Previous question was different. I tried that but didn't get expected result coz I've not asked it correctly. Commented Oct 20, 2016 at 12:51
  • For the w and w/o stuff, you might also be interested in this stackoverflow.com/questions/40020082/…. Do you have a single column w/o and w? Or do you have an arbitrary number of them each time followed by the excluded/included keyword? Commented Oct 20, 2016 at 12:53
  • @Wilmerton I've total 28 type of values in VARIABLES but I narrowed it down to 4 just for the sake of understanding but in some values I do not have w or w/o symbol like in subject I have "Spark, python, scala etc". And I will check your mentioned link. Commented Oct 20, 2016 at 12:55
  • 1
    for "merging" the two datasets, just search for question about join and dataframes, there are plenty of examples Commented Oct 20, 2016 at 13:01

1 Answer 1

1

As explained in the comments, you have can implement some of the tricky logic as in answers to splitting row in multiple row in spark-shell

data:

val df = List(
("2008","JOY       ","ANDERSON ","spark|python|scala;45;w/o sports;w datascience"),
("2008","STEVEN    ","JOHNSON  ","Spark|R;90|56"),
("2006","NIHA      ","DIVA     ","w/o sports")
).toDF("YEAR","FIRST NAME","LAST NAME","VARIABLE")

I only highlight the relatively tricky parts, you can figure it out the details yourself. I suggest to handle "w" and "w/o" tags separately. Furthermore, you have to explode the language in separate "sql" statements. This give

val step1 = df.withColumn("backrefReplace",split(regexp_replace('VARIABLE,"^([A-z|]+)?;?([\\d\\|]+)?;?(w.*)?$","$1"+sep+"$2"+sep+"$3"),sep))
  .withColumn("letter",explode(split('backrefReplace(0),"\\|")))
  .select('YEAR,$"FIRST NAME",$"LAST NAME",'VARIABLE,'letter,
    explode(split('backrefReplace(1),"\\|")).as("digits"),
    'backrefReplace(2).as("tags")
  )

which gives

scala> step1.show(false)
+----+----------+---------+----------------------------------------------+------+------+------------------------+
|YEAR|FIRST NAME|LAST NAME|VARIABLE                                      |letter|digits|tags                    |
+----+----------+---------+----------------------------------------------+------+------+------------------------+
|2008|JOY       |ANDERSON |spark|python|scala;45;w/o sports;w datascience|spark |45    |w/o sports;w datascience|
|2008|JOY       |ANDERSON |spark|python|scala;45;w/o sports;w datascience|python|45    |w/o sports;w datascience|
|2008|JOY       |ANDERSON |spark|python|scala;45;w/o sports;w datascience|scala |45    |w/o sports;w datascience|
|2008|STEVEN    |JOHNSON  |Spark|R;90|56                                 |Spark |90    |                        |
|2008|STEVEN    |JOHNSON  |Spark|R;90|56                                 |Spark |56    |                        |
|2008|STEVEN    |JOHNSON  |Spark|R;90|56                                 |R     |90    |                        |
|2008|STEVEN    |JOHNSON  |Spark|R;90|56                                 |R     |56    |                        |
|2006|NIHA      |DIVA     |w/o sports                                    |      |      |w/o sports              |
+----+----------+---------+----------------------------------------------+------+------+------------------------+

Then you have to handle capitalisation, and the tags. For the tags, you can have a relatively generic code using explode and pivot, but you have to do some cleaning to match your exact result. Here is an example:

List(("a;b;c")).toDF("str")
  .withColumn("char",explode(split('str,";")))
  .groupBy('str)
  .pivot("char")
  .count
  .show()

+-----+---+---+---+
|  str|  a|  b|  c|
+-----+---+---+---+
|a;b;c|  1|  1|  1|
+-----+---+---+---+

Read more about pivot here

The final step is simply to do a left join on the second dataset (first "RESULT").

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.