Pattern matching with regular expression in spark dataframes using spark-shell

Question

Suppose we are given dataset ("DATA") like :

YEAR | FIRST NAME | LAST NAME | VARIABLES
2008 | JOY        | ANDERSON  | spark|python|scala; 45;w/o sports;w datascience
2008 | STEVEN     | JOHNSON   | Spark|R; 90|56
2006 | NIHA       | DIVA      | w/o sports

and we have another dataset ("RESULT") like :

YEAR | FIRST NAME | LAST NAME 
1992 | EMMA       | CENA 
2008 | JOY        | ANDERSON
2008 | STEVEN     | ANDERSON
2006 | NIHA       | DIVA
and so on.

The output should be ("RESULT") :

YEAR | FIRST NAME | LAST NAME | SUBJECT | SCORE | SPORTS | DATASCIENCE
1992 | EMMA       | CENA      |         |       |        |              
2008 | JOY        | ANDERSON  | SPARK   | 45    | FALSE  | TRUE
2008 | JOY        | ANDERSON  | PYTHON  | 45    | FALSE  | TRUE
2008 | JOY        | ANDERSON  | SCALA   | 45    | FALSE  | TRUE
2008 | STEVEN     | ANDERSON  |         |       |        | 
2006 | NIHA       | DIVA      |         |       | FALSE  | 
2008 | STEVEN     | JOHNSON   | SPARK   | 90    |        |
2008 | STEVEN     | JOHNSON   | SPARK   | 56    |        |
2008 | STEVEN     | JOHNSON   | R       | 90    |        |
2008 | STEVEN     | JOHNSON   | R       | 56    |        |
and so on.

Please note that there are some rows in DATA which are not present in RESULT and vice-versa. For eg - "2008,STEVEN,JOHNSON" is not present in RESULT but is present in DATA. And the entries should be made in RESULT dataset. The columns {SUBJECT, SCORE, SPORTS, DATASCIENCE} are made by my intuition that "spark" refers to the SUBJECT and so on. Hope you understand my query. And I am using spark-shell with spark dataframes. Note that "Spark" and "spark" should be considered as same.

Seems very similar to your previous question, which was answered perfectly: stackoverflow.com/questions/40146760/… - you should be able to extrapolate that answer to this. Otherwise, please give a minimal example and show the code you've tried so far. — Tzach Zohar
– Tzach Zohar, Commented Oct 20, 2016 at 12:46
yes it was similar but not exactly what I want to ask coz I have two different datasets and I want mapping from DATA --> RESULT. Previous question was different. I tried that but didn't get expected result coz I've not asked it correctly. — Ishan
– Ishan, Commented Oct 20, 2016 at 12:51
For the w and w/o stuff, you might also be interested in this stackoverflow.com/questions/40020082/…. Do you have a single column w/o and w? Or do you have an arbitrary number of them each time followed by the excluded/included keyword? — Wilmerton
– Wilmerton, Commented Oct 20, 2016 at 12:53
@Wilmerton I've total 28 type of values in VARIABLES but I narrowed it down to 4 just for the sake of understanding but in some values I do not have w or w/o symbol like in subject I have "Spark, python, scala etc". And I will check your mentioned link. — Ishan
– Ishan, Commented Oct 20, 2016 at 12:55
for "merging" the two datasets, just search for question about join and dataframes, there are plenty of examples — Tzach Zohar
– Tzach Zohar, Commented Oct 20, 2016 at 13:01

Community · Accepted Answer · 2017-05-23 12:16:38Z

As explained in the comments, you have can implement some of the tricky logic as in answers to splitting row in multiple row in spark-shell

data:

val df = List(
("2008","JOY       ","ANDERSON ","spark|python|scala;45;w/o sports;w datascience"),
("2008","STEVEN    ","JOHNSON  ","Spark|R;90|56"),
("2006","NIHA      ","DIVA     ","w/o sports")
).toDF("YEAR","FIRST NAME","LAST NAME","VARIABLE")

I only highlight the relatively tricky parts, you can figure it out the details yourself. I suggest to handle "w" and "w/o" tags separately. Furthermore, you have to explode the language in separate "sql" statements. This give

val step1 = df.withColumn("backrefReplace",split(regexp_replace('VARIABLE,"^([A-z|]+)?;?([\\d\\|]+)?;?(w.*)?$","$1"+sep+"$2"+sep+"$3"),sep))
  .withColumn("letter",explode(split('backrefReplace(0),"\\|")))
  .select('YEAR,$"FIRST NAME",$"LAST NAME",'VARIABLE,'letter,
    explode(split('backrefReplace(1),"\\|")).as("digits"),
    'backrefReplace(2).as("tags")
  )

which gives

scala> step1.show(false)
+----+----------+---------+----------------------------------------------+------+------+------------------------+
|YEAR|FIRST NAME|LAST NAME|VARIABLE                                      |letter|digits|tags                    |
+----+----------+---------+----------------------------------------------+------+------+------------------------+
|2008|JOY       |ANDERSON |spark|python|scala;45;w/o sports;w datascience|spark |45    |w/o sports;w datascience|
|2008|JOY       |ANDERSON |spark|python|scala;45;w/o sports;w datascience|python|45    |w/o sports;w datascience|
|2008|JOY       |ANDERSON |spark|python|scala;45;w/o sports;w datascience|scala |45    |w/o sports;w datascience|
|2008|STEVEN    |JOHNSON  |Spark|R;90|56                                 |Spark |90    |                        |
|2008|STEVEN    |JOHNSON  |Spark|R;90|56                                 |Spark |56    |                        |
|2008|STEVEN    |JOHNSON  |Spark|R;90|56                                 |R     |90    |                        |
|2008|STEVEN    |JOHNSON  |Spark|R;90|56                                 |R     |56    |                        |
|2006|NIHA      |DIVA     |w/o sports                                    |      |      |w/o sports              |
+----+----------+---------+----------------------------------------------+------+------+------------------------+

Then you have to handle capitalisation, and the tags. For the tags, you can have a relatively generic code using explode and pivot, but you have to do some cleaning to match your exact result. Here is an example:

List(("a;b;c")).toDF("str")
  .withColumn("char",explode(split('str,";")))
  .groupBy('str)
  .pivot("char")
  .count
  .show()

+-----+---+---+---+
|  str|  a|  b|  c|
+-----+---+---+---+
|a;b;c|  1|  1|  1|
+-----+---+---+---+

Collectives™ on Stack Overflow

Pattern matching with regular expression in spark dataframes using spark-shell

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related