7

How can i write a dataframe having same column name after join operation into a csv file. Currently i am using the following code. dfFinal.coalesce(1).write.format('com.databricks.spark.csv').save('/home/user/output/',header = 'true')which will write the dataframe "dfFinal" in "/home/user/output".But it is not working in situaton that the dataframe contains a duplicate column. Below is the dfFinal dataframe.

+----------+---+-----------------+---+-----------------+
|  NUMBER  | ID|AMOUNT           | ID|           AMOUNT|
+----------+---+-----------------+---+-----------------+
|9090909092|  1|               30|  1|               40|
|9090909093|  2|               30|  2|               50|
|9090909090|  3|               30|  3|               60|
|9090909094|  4|               30|  4|               70|
+----------+---+-----------------+---+-----------------+

The above dataframe is formed after a join operation. When writing to a csv file it is giving me the following error.

pyspark.sql.utils.AnalysisException: u'Found duplicate column(s) when inserting into file:/home/user/output: `amount`, `id`;'
1
  • I think the best case it to rename the column before writing. Commented Oct 3, 2018 at 14:49

1 Answer 1

5

When you specifiy the join column as string or array type it will lead to only one column [1]. Pyspark example:

l = [('9090909092',1,30),('9090909093',2,30),('9090909090',3,30),('9090909094',4,30)] 
r = [(1,40),(2,50),(3,60),(4,70)]

left = spark.createDataFrame(l, ['NUMBER','ID','AMOUNT'])
right = spark.createDataFrame(r,['ID','AMOUNT'])

df = left.join(right, "ID")
df.show()

+---+----------+------+------+
| ID| NUMBER   |AMOUNT|AMOUNT|
+---+----------+------+------+ 
| 1 |9090909092| 30   | 40   |
| 3 |9090909090| 30   | 60   |
| 2 |9090909093| 30   | 50   |
| 4 |9090909094| 30   | 70   |
+---+----------+------+------+

But this will still produce duplicate column names in the dataframe for all columns which aren't a join column (AMOUNT column in this example). For these type of columns you should assign a new name before or after the join with the toDF dataframe function [2]:

newNames = ['ID','NUMBER', 'LAMOUNT', 'RAMOUNT']
df= df.toDF(*newNames)
df.show()

+---+----------+-------+-------+ 
| ID| NUMBER   |LAMOUNT|RAMOUNT|
+---+----------+-------+-------+ 
| 1 |9090909092| 30    | 40    | 
| 3 |9090909090| 30    | 60    | 
| 2 |9090909093| 30    | 50    | 
| 4 |9090909094| 30    | 70    | 
+---+----------+-------+-------+

[1] https://docs.databricks.com/spark/latest/faq/join-two-dataframes-duplicated-column.html

[2] http://spark.apache.org/docs/2.2.1/api/python/pyspark.sql.html#pyspark.sql.DataFrame.toDF

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.