How to write dataframe with duplicate column name into a csv file in pyspark

Question

How can i write a dataframe having same column name after join operation into a csv file. Currently i am using the following code. dfFinal.coalesce(1).write.format('com.databricks.spark.csv').save('/home/user/output/',header = 'true')which will write the dataframe "dfFinal" in "/home/user/output".But it is not working in situaton that the dataframe contains a duplicate column. Below is the dfFinal dataframe.

+----------+---+-----------------+---+-----------------+
|  NUMBER  | ID|AMOUNT           | ID|           AMOUNT|
+----------+---+-----------------+---+-----------------+
|9090909092|  1|               30|  1|               40|
|9090909093|  2|               30|  2|               50|
|9090909090|  3|               30|  3|               60|
|9090909094|  4|               30|  4|               70|
+----------+---+-----------------+---+-----------------+

The above dataframe is formed after a join operation. When writing to a csv file it is giving me the following error.

pyspark.sql.utils.AnalysisException: u'Found duplicate column(s) when inserting into file:/home/user/output: `amount`, `id`;'

I think the best case it to rename the column before writing. — Sailesh Kotha
– Sailesh Kotha, Commented Oct 3, 2018 at 14:49

cronoik · Accepted Answer · 2018-10-06 13:45:34Z

When you specifiy the join column as string or array type it will lead to only one column [1]. Pyspark example:

l = [('9090909092',1,30),('9090909093',2,30),('9090909090',3,30),('9090909094',4,30)] 
r = [(1,40),(2,50),(3,60),(4,70)]

left = spark.createDataFrame(l, ['NUMBER','ID','AMOUNT'])
right = spark.createDataFrame(r,['ID','AMOUNT'])

df = left.join(right, "ID")
df.show()

+---+----------+------+------+
| ID| NUMBER   |AMOUNT|AMOUNT|
+---+----------+------+------+ 
| 1 |9090909092| 30   | 40   |
| 3 |9090909090| 30   | 60   |
| 2 |9090909093| 30   | 50   |
| 4 |9090909094| 30   | 70   |
+---+----------+------+------+

But this will still produce duplicate column names in the dataframe for all columns which aren't a join column (AMOUNT column in this example). For these type of columns you should assign a new name before or after the join with the toDF dataframe function [2]:

newNames = ['ID','NUMBER', 'LAMOUNT', 'RAMOUNT']
df= df.toDF(*newNames)
df.show()

+---+----------+-------+-------+ 
| ID| NUMBER   |LAMOUNT|RAMOUNT|
+---+----------+-------+-------+ 
| 1 |9090909092| 30    | 40    | 
| 3 |9090909090| 30    | 60    | 
| 2 |9090909093| 30    | 50    | 
| 4 |9090909094| 30    | 70    | 
+---+----------+-------+-------+

[1] https://docs.databricks.com/spark/latest/faq/join-two-dataframes-duplicated-column.html

[2] http://spark.apache.org/docs/2.2.1/api/python/pyspark.sql.html#pyspark.sql.DataFrame.toDF

Collectives™ on Stack Overflow

How to write dataframe with duplicate column name into a csv file in pyspark

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related