2

I have a pyspark dataframe in which some of the columns have same name. I want to merge all the columns having same name in one column. For example, Input dataframe:

enter image description here

How can I do this in pyspark? Any help would be highly appreciated.

3
  • Are duplicated columns allowed for dataframe ? Commented Jun 18, 2021 at 3:12
  • Yes, due to some operations like column renaming, the dataframe has duplicate columns Commented Jun 18, 2021 at 3:42
  • duplicated columns are not selectable. You need to rework the prior processing steps to ensure column names are not duplicated Commented Jun 18, 2021 at 8:19

2 Answers 2

1

Edited to answer OP request to coalesce from list,

Here's a reproducible example

    import pyspark.sql.functions as F

    df = spark.createDataFrame([
        ("z","a", None, None),
        ("b",None,"c", None),
        ("c","b", None, None),
        ("d",None, None, "z"),
    ], ["a","c", "c","c"])
    
    df.show()
    
    #fix duplicated column names
    old_col=df.schema.names
    running_list=[]
    new_col=[]
    i=0
    for column in old_col:
        if(column in running_list):
            new_col.append(column+"_"+str(i))
            i=i+1
        else:
            new_col.append(column)
            running_list.append(column)
    print(new_col)
    
    df1 = df.toDF(*new_col)
    
    #coalesce columns to get one column from a list

a=['c','c_0','c_1']
to_drop=['c_0','c_1']
b=[]
[b.append(df1[col]) for col in a]

#coalesce columns to get one column
df_merged=df1.withColumn('c',F.coalesce(*b)).drop(*to_drop)
   
df_merged.show()

Output:

+---+----+----+----+
|  a|   c|   c|   c|
+---+----+----+----+
|  z|   a|null|null|
|  b|null|   c|null|
|  c|   b|null|null|
|  d|null|null|   z|
+---+----+----+----+

['a', 'c', 'c_0', 'c_1']

+---+---+
|  a|  c|
+---+---+
|  z|  a|
|  b|  c|
|  c|  b|
|  d|  z|
+---+---+
Sign up to request clarification or add additional context in comments.

4 Comments

Thanks for the suggestion. Here the problem is that I don't know the no. of columns which are duplicated and how many times they are duplicated. So, when we are using coalesce, what should I pass in coalesce and drop?
Actually, I'm able to manage the column names which I want to merge in a list. For example I have a python list a=["'c_0","c_1","c_2"] . Now, how can I pass this inside coalesce?
@Priyanshu i've edited the code to answer your question. Basically you just have to pass the list and unpack it in coalesce and drop with *. If i helped you please feel free to mark my answer as the accepted one!
Thanks a lot for help. Its working fine now. And yeah even if we simply pass *a in coalesce, its working fine. But simply passing *a inside coalesce was not working two days back. Just wondering if someone updated coalesce function in spark. Anyways, my problem is solved. Thanks once again.
1

Check below scala code. It might help you.

scala> :paste
// Entering paste mode (ctrl-D to finish)

import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import scala.annotation.tailrec
import scala.util.Try

implicit class DFHelpers(df: DataFrame) {
   def mergeColumns() = {
       val dupColumns = df.columns
       val newColumns = dupColumns.zipWithIndex.map(c => s"${c._1}${c._2}")
       val columns = newColumns
                        .map(c => (c(0),c))
                        .groupBy(_._1)
                        .map(c => (c._1,c._2.map(_._2)))
                        .map(c => s"""coalesce(${c._2.mkString(",")}) as ${c._1}""")
                        .toSeq
       df.toDF(newColumns:_*).selectExpr(columns:_*)
   }
}

// Exiting paste mode, now interpreting.
scala> df.show(false)
+----+----+----+----+----+----+
|a   |b   |a   |c   |a   |b   |
+----+----+----+----+----+----+
|4   |null|null|8   |null|21  |
|null|8   |7   |6   |null|null|
|96  |null|null|null|null|78  |
+----+----+----+----+----+----+
scala> df.printSchema
root
 |-- a: string (nullable = true)
 |-- b: string (nullable = true)
 |-- a: string (nullable = true)
 |-- c: string (nullable = true)
 |-- a: string (nullable = true)
 |-- b: string (nullable = true)

scala> df.mergeColumns.show(false)
+---+---+----+
|b  |a  |c   |
+---+---+----+
|21 |4  |8   |
|8  |7  |6   |
|78 |96 |null|
+---+---+----+

1 Comment

Hey thanks. Looks good but I'm unfamiliar with scala. If anyone can pls provide similar solution for python, it would be really very helpful

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.