dynamically create new columns using withColumn function from a list in PySpark

Question

I am trying to figure out how to dynamically create columns for each item in list(cp_codeset list in this case) by using withColumn() function and calling udf in the withColumn() function in the pySpark. Below is the code that I wrote but it is giving me an error.

from pyspark.sql.functions import udf, col, lit
from pyspark.sql import Row
from pyspark.sql.types import IntegerType


codeset = set(cp_codeset['CODE'])

for col_name in cp_codeset.col_names.unique():
    def flag(d):
        if (d in codeset):
            name = cp_codeset[cp_codeset['CODES']==d].col_names
            if(name==col_name):
                return 1
            else:
                return 0

    cpf_udf = udf(flag, IntegerType())

    p.withColumn(col_name, cpf_udf(p.codes)).show()

The other option is to manually do it but in that case I have to write same udf function and call it with withColumn() function 75 times (which is the size of cp_codeset["col_names"])

Below is my two dataframes and I am trying to get how result appears

P (This is Pyspark dataframe and this dataframe is too large for pandas to handle)

id|codes
1|100
2|102
3|104

cp_codeset (pandas dataframe)

codes| col_names
100|a
101|b
102|c
103|d
104|e
105|f

result (pyspark dataframe)

id|codes|a|c|e
1|100   |1|0|0
2|102   |0|1|0   
3|104   |0|0|1

Is that what you want? BTW why did you set [pandas] tag if you need a PySpark solution? It's confusing... — MaxU - stand with Ukraine
– MaxU - stand with Ukraine, Commented Mar 27, 2017 at 18:41
@MaxU Sorry for confusion.I only have cp_codeset as a Panda dataframe and have P as a pyspark dataframe. I want result in pyspark dataframe. Thank you. — mike
– mike, Commented Mar 27, 2017 at 18:45
@Viraj, in this case you have two answers from piRSquared and from Boud - it should be pretty straightforward to generate Spark DataFrame from Pandas DF... — MaxU - stand with Ukraine
– MaxU - stand with Ukraine, Commented Mar 27, 2017 at 18:50

Zeugma · Accepted Answer · 2017-03-27 17:48:28Z

2

With this data filtered:

cp_codeset.set_index('codes').loc[p.codes]
Out[44]: 
      col_names
codes          
100           a
102           c
104           e

Simply use get_dummies:

pd.get_dummies(cp_codeset.set_index('codes').loc[p.codes])
Out[45]: 
       col_names_a  col_names_c  col_names_e
codes                                       
100              1            0            0
102              0            1            0
104              0            0            1

answered Mar 27, 2017 at 17:48

Zeugma

32.3k9 gold badges73 silver badges85 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

piRSquared · Accepted Answer · 2017-03-27 18:33:10Z

2

I'd use get_dummies with join + map

m = cp_codeset.set_index('codes').col_names

P.join(pd.get_dummies(P.codes.map(m)))

   id  codes  a  c  e
0   1    100  1  0  0
1   2    102  0  1  0
2   3    104  0  0  1

answered Mar 27, 2017 at 18:33

piRSquared

296k68 gold badges509 silver badges654 bronze badges

2 Comments

mike Over a year ago

This gives error of "TypeError: 'Column' object is not callable". If I write P.select("codes").map(m) then it gives "AttributeError: 'DataFrame' object has no attribute '_jdf'"

piRSquared Over a year ago

@Viraj why are you writing P.select? That is no where in my code.

Collectives™ on Stack Overflow

dynamically create new columns using withColumn function from a list in PySpark

P (This is Pyspark dataframe and this dataframe is too large for pandas to handle)

cp_codeset (pandas dataframe)

result (pyspark dataframe)

2 Answers 2

Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

P (This is Pyspark dataframe and this dataframe is too large for pandas to handle)

cp_codeset (pandas dataframe)

result (pyspark dataframe)

2 Answers 2

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related