How to concat two array / list columns of different spark dataframes?

Question

Need a concat dataframe. Columns from two different spark dataframes. Looking for pyspark code.

df1.show()
+---------+
|    value|
+---------+
|[1, 2, 3]|
+---------+

df2.show()
+------+
| value|
+------+
|[4, 5]|
+------+


I need a dataframe as bleow:
+------------+
| value      |
+------------+
|[1,2,3,4,5] |
+------------+

So what say there are more rows? Positional dependence?

Ged
– Ged

2019-07-15 11:12:33 +00:00
Commented Jul 15, 2019 at 11:12 — Ged
– Ged, Commented Jul 15, 2019 at 11:12
Yes.Should concat only corresponding rows.

anvy elizabeth
– anvy elizabeth

2019-07-16 04:15:53 +00:00
Commented Jul 16, 2019 at 4:15 — anvy elizabeth
– anvy elizabeth, Commented Jul 16, 2019 at 4:15
Bases on position?

Ged
– Ged

2019-07-16 05:54:28 +00:00
Commented Jul 16, 2019 at 5:54 — Ged
– Ged, Commented Jul 16, 2019 at 5:54
yes Bases on position

anvy elizabeth
– anvy elizabeth

2019-07-16 07:31:44 +00:00
Commented Jul 16, 2019 at 7:31 — anvy elizabeth
– anvy elizabeth, Commented Jul 16, 2019 at 7:31
Try zipwithindex

Ged
– Ged

2019-07-16 07:33:57 +00:00
Commented Jul 16, 2019 at 7:33 — Ged
– Ged, Commented Jul 16, 2019 at 7:33

Ged · Accepted Answer · 2019-07-20 10:42:47Z

Some educational aspects here as well, and you can strip out the .show(), some data generation first.

Spark 2.4 assumed. Positional dependency is OK although some dispute if it is preserved with RDDs and such with just zipWithIndex; I have no evidence to doubt that. No performance considerations in terms of explicit partitioning, but no UDFs used. Assuming same number of rows in both DFs. DataSet not a pyspark object. Need rdd conversion.

import pyspark.sql.functions as f
from pyspark.sql.functions import col, concat

df1 = spark.createDataFrame([ list([[x,x+1,x+2]]) for x in range(7)], ['value'])
df2 = spark.createDataFrame([ list([[x+10,x+20]]) for x in range(7)], ['value'])
dfA = df1.rdd.map(lambda r: r.value).zipWithIndex().toDF(['value', 'index'])
dfB = df2.rdd.map(lambda r: r.value).zipWithIndex().toDF(['value', 'index'])

df_inner_join = dfA.join(dfB, dfA.index == dfB.index)
new_names = ['value1', 'index1', 'value2', 'index2']
df_renamed = df_inner_join.toDF(*new_names) # Issues with column renames otherwise!

df_result = df_renamed.select(col("index1"), concat(col("value1"), col("value2"))) 
new_names_final = ['index', 'value']
df_result_final = df_result.toDF(*new_names_final)

Data In (generated)

+---------+
|    value|
+---------+
|[0, 1, 2]|
|[1, 2, 3]|
|[2, 3, 4]|
|[3, 4, 5]|
|[4, 5, 6]|
|[5, 6, 7]|
|[6, 7, 8]|
+---------+

+--------+
|   value|
+--------+
|[10, 20]|
|[11, 21]|
|[12, 22]|
|[13, 23]|
|[14, 24]|
|[15, 25]|
|[16, 26]|
+--------+

Data Out

+-----+-----------------+
|index|            value|
+-----+-----------------+
|    0|[0, 1, 2, 10, 20]|
|    6|[6, 7, 8, 16, 26]|
|    5|[5, 6, 7, 15, 25]|
|    1|[1, 2, 3, 11, 21]|
|    3|[3, 4, 5, 13, 23]|
|    2|[2, 3, 4, 12, 22]|
|    4|[4, 5, 6, 14, 24]|
+-----+-----------------+

Collectives™ on Stack Overflow

How to concat two array / list columns of different spark dataframes?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related