looking if String contain a sub-string in differents Dataframes

Question

I have 2 dataframes:

df_1, column id contain only characters and numbers ==> normalized, and id_no_normalized Example:

 id_normalized   |  id_no_normalized
    -------------|-------------------
    ABC          |  A_B.C
    -------------|-------------------
    ERFD         |  E.R_FD
    -------------|-------------------
    12ZED        |   12_Z.ED

df_2, column name contain only characters and numbers ==> normalized are attached

Example:

name
----------------------------
googleisa12ZEDgoodnavigator
----------------------------
internetABCexplorer
----------------------------

I would like to look the id_normalized (dataset_1) if exist in name (dataset_2). If I find it, I take the value of id_no_normalized and I store it in a new column in dataset_2

Expect result:

   name                         |   result
    ----------------------------|----------
    googleisa12ZEDgoodnavigator |  12_Z.ED
    ----------------------------|----------
    internetABCexplorer         |  A_B.C
    ----------------------------|----------

I did it using this code:

df_result = df_2.withColumn("id_no_normalized", dft_2.name.contains(df_1.id_normalized))
    return df_result.select("name", "id_normalized")

is not working because, it doesn't find the id_normalized in the df_2.

Second solution, it work only when I limited the output on 300 rows almost, but when I return all the data, is took many time running and not finish:

   df_1 = df_1.select("id_no_normalized").drop_duplicates()
df_1 = df_1.withColumn(
    "id_normalized",
    F.regexp_replace(F.col("id_no_normalized"), "[^a-zA-Z0-9]+", ""))
df_2 = df_2.select("name")
extract = F.expr('position(id_normalized IN name)>0')
result = df_1.join(df_2, extract)
return result

How can I correct my code to resolve it ? Thank you

There is no easy way to do what you are asking efficiently. This type of join will result in a Cartesian product, which will be slow. See Efficient string suffix detection and Efficient string matching in Apache Spark — pault
– pault, Commented Dec 5, 2019 at 17:09
@pault thanks for your answer. In fact, I succeeded to join the 2 datasets, my idea was to use monotonically_increasing_id() function in each datasaet then I joined them using the new colunm that contain the monotonically_increasing_id values as key. I asked a new question if you can help me please knowing that I used spark version < 2.4. Thank you. my question is: stackoverflow.com/questions/59228817/… — verojoucla
– verojoucla, Commented Dec 7, 2019 at 18:59

pault · Accepted Answer · 2019-12-05 17:10:16Z

1

We can solve this using cross join and applying UDF on new DF, but again we need to ensure it works on a big dataset.

from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType

data1 = [
 {"id_normalized":"ABC","id_no_normalized":"A_B.C"},
 {"id_normalized":"ERFD","id_no_normalized":"E.R_FD"},
 {"id_normalized":"12ZED","id_no_normalized":"12_Z.ED"}
]

data2 = [
 {"name": "googleisa12ZEDgoodnavigator"},
 {"name": "internetABCexplorer"}
]

df1 = spark.createDataFrame(data1, ["id_no_normalized", "id_normalized"])
df2 = spark.createDataFrame(data2, ["name"])

df3 = df1.crossJoin(df2)
search_for_udf = udf(lambda name,id_normalized: name.find(id_normalized), returnType=IntegerType())
df4 = df3.withColumn("contain", search_for_udf(df3["name"], df3["id_normalized"]))
df4.filter(df4["contain"] > -1).show()


>>> df4.filter(df4["contain"] > -1).show()
+----------------+-------------+--------------------+-------+
|id_no_normalized|id_normalized|                name|contain|
+----------------+-------------+--------------------+-------+
|           A_B.C|          ABC| internetABCexplorer|      8|
|         12_Z.ED|        12ZED|googleisa12ZEDgoo...|      9|
+----------------+-------------+--------------------+-------+

I believe there are some spark techniques available to make cross join efficient.

edited Dec 5, 2019 at 17:10

pault

43.7k17 gold badges121 silver badges161 bronze badges

answered Dec 5, 2019 at 14:59

Hussain Bohra

1,00510 silver badges15 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

verojoucla Over a year ago

Thank you for your quick answer, you already mentioned, I need to ensure it works on a big dataset. I think, the problem is because I'm working on a big data

verojoucla Over a year ago

not this result I expect, see my question above please. Thanks

Collectives™ on Stack Overflow

looking if String contain a sub-string in differents Dataframes

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related