1

I have 2 dataframes:

df_1, column id contain only characters and numbers ==> normalized, and id_no_normalized Example:

 id_normalized   |  id_no_normalized
    -------------|-------------------
    ABC          |  A_B.C
    -------------|-------------------
    ERFD         |  E.R_FD
    -------------|-------------------
    12ZED        |   12_Z.ED

df_2, column name contain only characters and numbers ==> normalized are attached

Example:

name
----------------------------
googleisa12ZEDgoodnavigator
----------------------------
internetABCexplorer
----------------------------

I would like to look the id_normalized (dataset_1) if exist in name (dataset_2). If I find it, I take the value of id_no_normalized and I store it in a new column in dataset_2

Expect result:

   name                         |   result
    ----------------------------|----------
    googleisa12ZEDgoodnavigator |  12_Z.ED
    ----------------------------|----------
    internetABCexplorer         |  A_B.C
    ----------------------------|----------

I did it using this code:

df_result = df_2.withColumn("id_no_normalized", dft_2.name.contains(df_1.id_normalized))
    return df_result.select("name", "id_normalized")

is not working because, it doesn't find the id_normalized in the df_2.

Second solution, it work only when I limited the output on 300 rows almost, but when I return all the data, is took many time running and not finish:

   df_1 = df_1.select("id_no_normalized").drop_duplicates()
df_1 = df_1.withColumn(
    "id_normalized",
    F.regexp_replace(F.col("id_no_normalized"), "[^a-zA-Z0-9]+", ""))
df_2 = df_2.select("name")
extract = F.expr('position(id_normalized IN name)>0')
result = df_1.join(df_2, extract)
return result

How can I correct my code to resolve it ? Thank you

3
  • 1
    There is no easy way to do what you are asking efficiently. This type of join will result in a Cartesian product, which will be slow. See Efficient string suffix detection and Efficient string matching in Apache Spark Commented Dec 5, 2019 at 17:09
  • 1
    How many unique id_normalized do you have? Commented Dec 6, 2019 at 10:47
  • @pault thanks for your answer. In fact, I succeeded to join the 2 datasets, my idea was to use monotonically_increasing_id() function in each datasaet then I joined them using the new colunm that contain the monotonically_increasing_id values as key. I asked a new question if you can help me please knowing that I used spark version < 2.4. Thank you. my question is: stackoverflow.com/questions/59228817/… Commented Dec 7, 2019 at 18:59

1 Answer 1

1

We can solve this using cross join and applying UDF on new DF, but again we need to ensure it works on a big dataset.

from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType

data1 = [
 {"id_normalized":"ABC","id_no_normalized":"A_B.C"},
 {"id_normalized":"ERFD","id_no_normalized":"E.R_FD"},
 {"id_normalized":"12ZED","id_no_normalized":"12_Z.ED"}
]

data2 = [
 {"name": "googleisa12ZEDgoodnavigator"},
 {"name": "internetABCexplorer"}
]

df1 = spark.createDataFrame(data1, ["id_no_normalized", "id_normalized"])
df2 = spark.createDataFrame(data2, ["name"])

df3 = df1.crossJoin(df2)
search_for_udf = udf(lambda name,id_normalized: name.find(id_normalized), returnType=IntegerType())
df4 = df3.withColumn("contain", search_for_udf(df3["name"], df3["id_normalized"]))
df4.filter(df4["contain"] > -1).show()


>>> df4.filter(df4["contain"] > -1).show()
+----------------+-------------+--------------------+-------+
|id_no_normalized|id_normalized|                name|contain|
+----------------+-------------+--------------------+-------+
|           A_B.C|          ABC| internetABCexplorer|      8|
|         12_Z.ED|        12ZED|googleisa12ZEDgoo...|      9|
+----------------+-------------+--------------------+-------+

I believe there are some spark techniques available to make cross join efficient.

Sign up to request clarification or add additional context in comments.

2 Comments

Thank you for your quick answer, you already mentioned, I need to ensure it works on a big dataset. I think, the problem is because I'm working on a big data
not this result I expect, see my question above please. Thanks

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.