3

I have a spark dataframe, and I wish to check whether each string in a particular column exists in a pre-defined a column of another dataframe. I have found a same problem in Spark (scala) dataframes - Check whether strings in column contain any items from a set

but I want to Check whether strings in column exists in a column of another dataframe not a List or a set follow that question. Who can help me! I don't know convert a column to a set or a list and i don't know "exists" method in dataframe.

My data is similar to this

df1:

    +---+-----------------+
    | id|      url        |
    +---+-----------------+
    |  1|google.com       |
    |  2|facebook.com     |
    |  3|github.com       |
    |  4|stackoverflow.com|
    +---+-----------------+

df2:

    +-----+------------+
    | id  | urldetail  |
    +-----+------------+
    |  11 |google.com  |
    |  12 |yahoo.com   |
    |  13 |facebook.com|
    |  14 |twitter.com |
    |  15 |youtube.com |
    +-----+------------+

Now, i am trying to create a third column with the results of a comparison to see if the strings in the $"urldetail" column if exists in $"url"

    +---+------------+-------------+
    | id|  urldetail |   check     | 
    +---+------------+-------------+
    | 11|google.com  |        1    |     
    | 12|yahoo.com   |        0    |
    | 13|facebook.com|        1    |
    | 14|twitter.com |        0    |
    | 15|youtube.com |        0    |
    +---+------------+-------------+

I want to use UDF but i don't know how to check whether string exists in a column of a dataframe! Please help me!

5
  • Welcome to SO ! I suggest you some readings : stackoverflow.com/help/minimal-reproducible-example, stackoverflow.com/help/how-to-ask. I also suggest you to add inputs and outputs Commented Jul 22, 2019 at 15:57
  • @Hiệp Bạch if your data looks like that just join on he string columns for each DataFrame. If the string column of df2 is a string containing multiple words, then it is a bit more complicated. Commented Jul 23, 2019 at 16:04
  • i still can't join them. Can you give me some code? Commented Jul 23, 2019 at 16:40
  • i have to do lot of hardwork since you have not posted the data in first place next post pls take care.. Commented Jul 23, 2019 at 16:56
  • I'm sorry, I'm a new user so my question has not been clear. Your answer is very helpful, but I have not found the desired result :( Commented Jul 23, 2019 at 17:11

2 Answers 2

1

I have a spark dataframe, and I wish to check whether each string in a particular column contains any number of words from a pre-defined a column of another dataframe.

Here is the way. using = or like

package examples

import org.apache.log4j.Level
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{col, _}

object CompareColumns extends App {
  val logger = org.apache.log4j.Logger.getLogger("org")
  logger.setLevel(Level.WARN)
  val spark = SparkSession.builder()
    .appName(this.getClass.getName)
    .config("spark.master", "local").getOrCreate()

  import spark.implicits._

  val df1 = Seq(

    (1, "google.com"),
    (2, "facebook.com"),
    (3, "github.com"),
    (4, "stackoverflow.com")).toDF("id", "url").as("first")
  df1.show
  val df2 = Seq(
    (11, "google.com"),
    (12, "yahoo.com"),
    (13, "facebook.com"),
    (14, "twitter.com")).toDF("id", "url").as("second")
  df2.show
  val df3 = df2.join(df1, expr("first.url  like  second.url"), "full_outer").select(
    col("first.url")
    , col("first.url").contains(col("second.url")).as("check")).filter("url is not null")
  df3.na.fill(Map("check" -> false))
    .show


}



Result :

+---+-----------------+
| id|              url|
+---+-----------------+
|  1|       google.com|
|  2|     facebook.com|
|  3|       github.com|
|  4|stackoverflow.com|
+---+-----------------+

+---+------------+
| id|         url|
+---+------------+
| 11|  google.com|
| 12|   yahoo.com|
| 13|facebook.com|
| 14| twitter.com|
+---+------------+

+-----------------+-----+
|              url|check|
+-----------------+-----+
|       google.com| true|
|     facebook.com| true|
|       github.com|false|
|stackoverflow.com|false|
+-----------------+-----+

with full outer join we can achive this... For more details see my article with all joins here in my linked in post

Note : Instead of 0 for false 1 for true i have used boolean conditions here.. you can translate them in to what ever you wanted...

UPDATE : If rows are increasing in second dataframe you can use this, it wont miss any rows from second

val df3 = df2.join(df1, expr("first.url  like  second.url"), "full").select(
    col("second.*")
    , col("first.url").contains(col("second.url")).as("check"))
    .filter("url is not null")
  df3.na.fill(Map("check" -> false))
    .show

Also, one more thing is you can try regexp_extract as shown in below post

https://stackoverflow.com/a/53880542/647053

Sign up to request clarification or add additional context in comments.

7 Comments

IDs of 2 df are different; so I can't join them :(
yes in the last join i have not used id column to join. pls check. one problem here is if strings are not matching in the join then it may result x product
It is problem in my data, i have edited my post. When I join, so many strings are not matching! :( and size of 2 dfs are different
i want result of all row, because my target are strings have result false
boss! this is the data you gave and have done very close. take this and modify according to your need. approach is same
|
1

read in your data and use the trim operation just to be conservative when joining on the strings to remove the whitesapace

val df= Seq((1,"google.com"), (2,"facebook.com"), ( 3,"github.com "), (4,"stackoverflow.com")).toDF("id", "url").select($"id", trim($"url").as("url"))    


val df2   =Seq(( 11 ,"google.com"), (12 ,"yahoo.com"), (13 ,"facebook.com"),(14 ,"twitter.com"),(15,"youtube.com")).toDF( "id" ,"urldetail").select($"id", trim($"urldetail").as("urldetail")) 


df.join(df2.withColumn("flag", lit(1)).drop("id"), (df("url")===df2("urldetail")), "left_outer").withColumn("contains_bool",
    when($"flag"===1, true) otherwise(false)).drop("flag","urldetail").show


+---+-----------------+-------------+
| id|              url|contains_bool|
+---+-----------------+-------------+
|  1|       google.com|         true|
|  2|     facebook.com|         true|
|  3|       github.com|        false|
|  4|stackoverflow.com|        false|
+---+-----------------+-------------+

2 Comments

why split?? Beside IDs of 2 DF are different so I can't join them. I'm using Spark version 2.3.x so I want to use udf, but I don't know how to check whether string exists in a column of a dataframe. :(
@Hiệp Bạch you need to be explicit about what your data looks like. If the ids are different then there are many other issued that must be considered.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.