How to merge duplicate columns in pyspark?

Question

I have a pyspark dataframe in which some of the columns have same name. I want to merge all the columns having same name in one column. For example, Input dataframe:

How can I do this in pyspark? Any help would be highly appreciated.

Yes, due to some operations like column renaming, the dataframe has duplicate columns — Priyanshu
– Priyanshu, Commented Jun 18, 2021 at 3:42
duplicated columns are not selectable. You need to rework the prior processing steps to ensure column names are not duplicated — mck
– mck, Commented Jun 18, 2021 at 8:19

nonoDa · Accepted Answer · 2021-06-18 13:46:43Z

1

Edited to answer OP request to coalesce from list,

Here's a reproducible example

    import pyspark.sql.functions as F

    df = spark.createDataFrame([
        ("z","a", None, None),
        ("b",None,"c", None),
        ("c","b", None, None),
        ("d",None, None, "z"),
    ], ["a","c", "c","c"])
    
    df.show()
    
    #fix duplicated column names
    old_col=df.schema.names
    running_list=[]
    new_col=[]
    i=0
    for column in old_col:
        if(column in running_list):
            new_col.append(column+"_"+str(i))
            i=i+1
        else:
            new_col.append(column)
            running_list.append(column)
    print(new_col)
    
    df1 = df.toDF(*new_col)
    
    #coalesce columns to get one column from a list

a=['c','c_0','c_1']
to_drop=['c_0','c_1']
b=[]
[b.append(df1[col]) for col in a]

#coalesce columns to get one column
df_merged=df1.withColumn('c',F.coalesce(*b)).drop(*to_drop)
   
df_merged.show()

Output:

+---+----+----+----+
|  a|   c|   c|   c|
+---+----+----+----+
|  z|   a|null|null|
|  b|null|   c|null|
|  c|   b|null|null|
|  d|null|null|   z|
+---+----+----+----+

['a', 'c', 'c_0', 'c_1']

+---+---+
|  a|  c|
+---+---+
|  z|  a|
|  b|  c|
|  c|  b|
|  d|  z|
+---+---+

edited Jun 18, 2021 at 13:46

answered Jun 18, 2021 at 7:43

nonoDa

4535 silver badges18 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Priyanshu Over a year ago

Thanks for the suggestion. Here the problem is that I don't know the no. of columns which are duplicated and how many times they are duplicated. So, when we are using coalesce, what should I pass in coalesce and drop?

Priyanshu Over a year ago

Actually, I'm able to manage the column names which I want to merge in a list. For example I have a python list a=["'c_0","c_1","c_2"] . Now, how can I pass this inside coalesce?

nonoDa Over a year ago

@Priyanshu i've edited the code to answer your question. Basically you just have to pass the list and unpack it in coalesce and drop with *. If i helped you please feel free to mark my answer as the accepted one!

Priyanshu Over a year ago

Thanks a lot for help. Its working fine now. And yeah even if we simply pass *a in coalesce, its working fine. But simply passing *a inside coalesce was not working two days back. Just wondering if someone updated coalesce function in spark. Anyways, my problem is solved. Thanks once again.

s.polam · Accepted Answer · 2021-06-18 10:50:49Z

Check below scala code. It might help you.

scala> :paste
// Entering paste mode (ctrl-D to finish)

import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import scala.annotation.tailrec
import scala.util.Try

implicit class DFHelpers(df: DataFrame) {
   def mergeColumns() = {
       val dupColumns = df.columns
       val newColumns = dupColumns.zipWithIndex.map(c => s"${c._1}${c._2}")
       val columns = newColumns
                        .map(c => (c(0),c))
                        .groupBy(_._1)
                        .map(c => (c._1,c._2.map(_._2)))
                        .map(c => s"""coalesce(${c._2.mkString(",")}) as ${c._1}""")
                        .toSeq
       df.toDF(newColumns:_*).selectExpr(columns:_*)
   }
}

// Exiting paste mode, now interpreting.

scala> df.show(false)
+----+----+----+----+----+----+
|a   |b   |a   |c   |a   |b   |
+----+----+----+----+----+----+
|4   |null|null|8   |null|21  |
|null|8   |7   |6   |null|null|
|96  |null|null|null|null|78  |
+----+----+----+----+----+----+

scala> df.printSchema
root
 |-- a: string (nullable = true)
 |-- b: string (nullable = true)
 |-- a: string (nullable = true)
 |-- c: string (nullable = true)
 |-- a: string (nullable = true)
 |-- b: string (nullable = true)

scala> df.mergeColumns.show(false)
+---+---+----+
|b  |a  |c   |
+---+---+----+
|21 |4  |8   |
|8  |7  |6   |
|78 |96 |null|
+---+---+----+

Hey thanks. Looks good but I'm unfamiliar with scala. If anyone can pls provide similar solution for python, it would be really very helpful

Collectives™ on Stack Overflow

How to merge duplicate columns in pyspark?

2 Answers 2

4 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related