2

coming from SAS I want to join multiple dataframes in one SQL join in PySpark. In SAS, thats possible, however, i get the sense that in Pyspark it is not. My script looks like this:

A.createOrReplaceTempView("A")
B.createOrReplaceTempView("B")
C.createOrReplaceTempView("C")

D = spark.sql("select a.*, b.VAR_B, C.VAR_C
              from A a    left join B b on a.VAR == b.VAR
                          left join C c on  a.VAR == c.VAR")

Is that possible in PySpark? Thank you!

1 Answer 1

2

In PySpark, joins work in a similar way to SQL.

First define a df, for example

df_a = spark.sql('select * from a)
df_b = spark.sql('select * from b)
df_c = spark.sql('select * from c)

Then you can do the join as following -

df_joined_a = df_a.join(df_b, a['VAR'] == b['VAR'], 'left')\
.select(df_a['*'], df_b['VAR'].alias('b_var'))
df_joined_c = df_joined_a.join(df_c, df_joined_a['VAR'] == c['VAR'], 'left')\
.select(df_joined_a['*'], df.c['VAR'])

More examples are available here - https://sparkbyexamples.com/pyspark/pyspark-join-explained-with-examples/

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.