0

How can I add the values from dataframe A to a new column (sum) in dataframe B that contains the given pairs of dataframe A? Preferably with a UDF?

output should look like this:

dataframe A:

|id|value|
|--|-----|
|1 |   10|
|2 |  0.3|
|3 |  100|

dataframe B:(with added column sum)

|src|dst|sum  |
|---|---|-----|
|1  |2  |10.3 |
|2  |3  |100.3|
|3  |1  |110  |

I've tried this

dfB = dfB.withColumn('sum', sum(dfB.source,dfB.dst,dfA))

def sum(src,dst,dfA):
    return dfA.filter(dfA.id == src).collect()[0][1][0] + dfA.filter(dfA.id == dst).collect()[0][1][0]


2 Answers 2

1

If dfA is small enough for a broadcast join, then then this should work:

dfB.join(dfA, how="left", on=F.col("src") == F.col("id")).select(
    "src", "dst", F.coalesce(F.col("value"), F.lit(0)).alias("v1")
).join(dfA, how="left", on=F.col("src") == F.col("id")).select(
    "src", "dst", (F.col("v1") + F.coalesce(F.col("value"), F.lit(0))).alias("sum")
)

You can remove .coalesce(), if the id column contains every src and dst value. There's a few ways to functional this, but your best bet may be using .transform().

def join_sum(join_df):
    def _(df):
        return (
            df.join(join_df, how="left", on=F.col("src") == F.col("id"))
            .select("src", "dst", F.coalesce(F.col("value"), F.lit(0)).alias("v1"))
            .join(join_df, how="left", on=F.col("src") == F.col("id"))
            .select(
                "src",
                "dst",
                (F.col("v1") + F.coalesce(F.col("value"), F.lit(0))).alias("sum"),
            )
        )

    return _


dfB.transform(join_sum(dfA))
Sign up to request clarification or add additional context in comments.

Comments

1

Basically you need to join the 2 dataframes on condition (id = src OR id = dst) then group by to sum the column value:

from pyspark.sql import functions as F

output = df_a.join(
    df_b, 
    (F.col("id") == F.col("src")) | (F.col("id") == F.col("dst"))
).groupBy("src", "dst").agg(F.sum("value").alias("sum"))

output.show()
#+---+---+-----+
#|src|dst|  sum|
#+---+---+-----+
#|  2|  3|100.3|
#|  1|  2| 10.3|
#|  3|  1|110.0|
#+---+---+-----+

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.