0

Need a concat dataframe. Columns from two different spark dataframes. Looking for pyspark code.

df1.show()
+---------+
|    value|
+---------+
|[1, 2, 3]|
+---------+

df2.show()
+------+
| value|
+------+
|[4, 5]|
+------+


I need a dataframe as bleow:
+------------+
| value      |
+------------+
|[1,2,3,4,5] |
+------------+
14
  • So what say there are more rows? Positional dependence? Commented Jul 15, 2019 at 11:12
  • Yes.Should concat only corresponding rows. Commented Jul 16, 2019 at 4:15
  • Bases on position? Commented Jul 16, 2019 at 5:54
  • yes Bases on position Commented Jul 16, 2019 at 7:31
  • Try zipwithindex Commented Jul 16, 2019 at 7:33

1 Answer 1

1

Some educational aspects here as well, and you can strip out the .show(), some data generation first.

Spark 2.4 assumed. Positional dependency is OK although some dispute if it is preserved with RDDs and such with just zipWithIndex; I have no evidence to doubt that. No performance considerations in terms of explicit partitioning, but no UDFs used. Assuming same number of rows in both DFs. DataSet not a pyspark object. Need rdd conversion.

import pyspark.sql.functions as f
from pyspark.sql.functions import col, concat

df1 = spark.createDataFrame([ list([[x,x+1,x+2]]) for x in range(7)], ['value'])
df2 = spark.createDataFrame([ list([[x+10,x+20]]) for x in range(7)], ['value'])
dfA = df1.rdd.map(lambda r: r.value).zipWithIndex().toDF(['value', 'index'])
dfB = df2.rdd.map(lambda r: r.value).zipWithIndex().toDF(['value', 'index'])

df_inner_join = dfA.join(dfB, dfA.index == dfB.index)
new_names = ['value1', 'index1', 'value2', 'index2']
df_renamed = df_inner_join.toDF(*new_names) # Issues with column renames otherwise!

df_result = df_renamed.select(col("index1"), concat(col("value1"), col("value2"))) 
new_names_final = ['index', 'value']
df_result_final = df_result.toDF(*new_names_final)

Data In (generated)

+---------+
|    value|
+---------+
|[0, 1, 2]|
|[1, 2, 3]|
|[2, 3, 4]|
|[3, 4, 5]|
|[4, 5, 6]|
|[5, 6, 7]|
|[6, 7, 8]|
+---------+

+--------+
|   value|
+--------+
|[10, 20]|
|[11, 21]|
|[12, 22]|
|[13, 23]|
|[14, 24]|
|[15, 25]|
|[16, 26]|
+--------+

Data Out

+-----+-----------------+
|index|            value|
+-----+-----------------+
|    0|[0, 1, 2, 10, 20]|
|    6|[6, 7, 8, 16, 26]|
|    5|[5, 6, 7, 15, 25]|
|    1|[1, 2, 3, 11, 21]|
|    3|[3, 4, 5, 13, 23]|
|    2|[2, 3, 4, 12, 22]|
|    4|[4, 5, 6, 14, 24]|
+-----+-----------------+
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.