Spark (scala) dataframes - Check whether strings in column exist in a column of another dataframe

Question

I have a spark dataframe, and I wish to check whether each string in a particular column exists in a pre-defined a column of another dataframe. I have found a same problem in Spark (scala) dataframes - Check whether strings in column contain any items from a set

but I want to Check whether strings in column exists in a column of another dataframe not a List or a set follow that question. Who can help me! I don't know convert a column to a set or a list and i don't know "exists" method in dataframe.

My data is similar to this

df1:

    +---+-----------------+
    | id|      url        |
    +---+-----------------+
    |  1|google.com       |
    |  2|facebook.com     |
    |  3|github.com       |
    |  4|stackoverflow.com|
    +---+-----------------+

df2:

    +-----+------------+
    | id  | urldetail  |
    +-----+------------+
    |  11 |google.com  |
    |  12 |yahoo.com   |
    |  13 |facebook.com|
    |  14 |twitter.com |
    |  15 |youtube.com |
    +-----+------------+

Now, i am trying to create a third column with the results of a comparison to see if the strings in the $"urldetail" column if exists in $"url"

    +---+------------+-------------+
    | id|  urldetail |   check     | 
    +---+------------+-------------+
    | 11|google.com  |        1    |     
    | 12|yahoo.com   |        0    |
    | 13|facebook.com|        1    |
    | 14|twitter.com |        0    |
    | 15|youtube.com |        0    |
    +---+------------+-------------+

I want to use UDF but i don't know how to check whether string exists in a column of a dataframe! Please help me!

Welcome to SO ! I suggest you some readings : stackoverflow.com/help/minimal-reproducible-example, stackoverflow.com/help/how-to-ask. I also suggest you to add inputs and outputs — BlueSheepToken
– BlueSheepToken, Commented Jul 22, 2019 at 15:57
@Hiệp Bạch if your data looks like that just join on he string columns for each DataFrame. If the string column of df2 is a string containing multiple words, then it is a bit more complicated. — mikeL
– mikeL, Commented Jul 23, 2019 at 16:04
i have to do lot of hardwork since you have not posted the data in first place next post pls take care.. — Ram Ghadiyaram
– Ram Ghadiyaram, Commented Jul 23, 2019 at 16:56
I'm sorry, I'm a new user so my question has not been clear. Your answer is very helpful, but I have not found the desired result :( — Hiệp Bạch
– Hiệp Bạch, Commented Jul 23, 2019 at 17:11

Ram Ghadiyaram · Accepted Answer · 2019-07-23 17:23:25Z

1

I have a spark dataframe, and I wish to check whether each string in a particular column contains any number of words from a pre-defined a column of another dataframe.

Here is the way. using = or like

package examples

import org.apache.log4j.Level
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{col, _}

object CompareColumns extends App {
  val logger = org.apache.log4j.Logger.getLogger("org")
  logger.setLevel(Level.WARN)
  val spark = SparkSession.builder()
    .appName(this.getClass.getName)
    .config("spark.master", "local").getOrCreate()

  import spark.implicits._

  val df1 = Seq(

    (1, "google.com"),
    (2, "facebook.com"),
    (3, "github.com"),
    (4, "stackoverflow.com")).toDF("id", "url").as("first")
  df1.show
  val df2 = Seq(
    (11, "google.com"),
    (12, "yahoo.com"),
    (13, "facebook.com"),
    (14, "twitter.com")).toDF("id", "url").as("second")
  df2.show
  val df3 = df2.join(df1, expr("first.url  like  second.url"), "full_outer").select(
    col("first.url")
    , col("first.url").contains(col("second.url")).as("check")).filter("url is not null")
  df3.na.fill(Map("check" -> false))
    .show


}

Result :

+---+-----------------+
| id|              url|
+---+-----------------+
|  1|       google.com|
|  2|     facebook.com|
|  3|       github.com|
|  4|stackoverflow.com|
+---+-----------------+

+---+------------+
| id|         url|
+---+------------+
| 11|  google.com|
| 12|   yahoo.com|
| 13|facebook.com|
| 14| twitter.com|
+---+------------+

+-----------------+-----+
|              url|check|
+-----------------+-----+
|       google.com| true|
|     facebook.com| true|
|       github.com|false|
|stackoverflow.com|false|
+-----------------+-----+

with full outer join we can achive this... For more details see my article with all joins here in my linked in post

Note : Instead of 0 for false 1 for true i have used boolean conditions here.. you can translate them in to what ever you wanted...

UPDATE : If rows are increasing in second dataframe you can use this, it wont miss any rows from second

val df3 = df2.join(df1, expr("first.url  like  second.url"), "full").select(
    col("second.*")
    , col("first.url").contains(col("second.url")).as("check"))
    .filter("url is not null")
  df3.na.fill(Map("check" -> false))
    .show

Also, one more thing is you can try regexp_extract as shown in below post

https://stackoverflow.com/a/53880542/647053

edited Jul 23, 2019 at 17:23

answered Jul 22, 2019 at 17:11

Ram Ghadiyaram

29.4k16 gold badges102 silver badges133 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Hiệp Bạch Over a year ago

IDs of 2 df are different; so I can't join them :(

Ram Ghadiyaram Over a year ago

yes in the last join i have not used id column to join. pls check. one problem here is if strings are not matching in the join then it may result x product

Hiệp Bạch Over a year ago

It is problem in my data, i have edited my post. When I join, so many strings are not matching! :( and size of 2 dfs are different

Hiệp Bạch Over a year ago

i want result of all row, because my target are strings have result false

Ram Ghadiyaram Over a year ago

boss! this is the data you gave and have done very close. take this and modify according to your need. approach is same

|

mikeL · Accepted Answer · 2019-07-23 17:10:17Z

1

read in your data and use the trim operation just to be conservative when joining on the strings to remove the whitesapace

val df= Seq((1,"google.com"), (2,"facebook.com"), ( 3,"github.com "), (4,"stackoverflow.com")).toDF("id", "url").select($"id", trim($"url").as("url"))    


val df2   =Seq(( 11 ,"google.com"), (12 ,"yahoo.com"), (13 ,"facebook.com"),(14 ,"twitter.com"),(15,"youtube.com")).toDF( "id" ,"urldetail").select($"id", trim($"urldetail").as("urldetail")) 


df.join(df2.withColumn("flag", lit(1)).drop("id"), (df("url")===df2("urldetail")), "left_outer").withColumn("contains_bool",
    when($"flag"===1, true) otherwise(false)).drop("flag","urldetail").show


+---+-----------------+-------------+
| id|              url|contains_bool|
+---+-----------------+-------------+
|  1|       google.com|         true|
|  2|     facebook.com|         true|
|  3|       github.com|        false|
|  4|stackoverflow.com|        false|
+---+-----------------+-------------+

edited Jul 23, 2019 at 17:10

answered Jul 22, 2019 at 21:13

mikeL

1,1142 gold badges12 silver badges27 bronze badges

2 Comments

Hiệp Bạch Over a year ago

why split?? Beside IDs of 2 DF are different so I can't join them. I'm using Spark version 2.3.x so I want to use udf, but I don't know how to check whether string exists in a column of a dataframe. :(

mikeL Over a year ago

@Hiệp Bạch you need to be explicit about what your data looks like. If the ids are different then there are many other issued that must be considered.

Collectives™ on Stack Overflow

Spark (scala) dataframes - Check whether strings in column exist in a column of another dataframe

2 Answers 2

7 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

7 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related