2

I am trying to figure out how to dynamically create columns for each item in list(cp_codeset list in this case) by using withColumn() function and calling udf in the withColumn() function in the pySpark. Below is the code that I wrote but it is giving me an error.

from pyspark.sql.functions import udf, col, lit
from pyspark.sql import Row
from pyspark.sql.types import IntegerType


codeset = set(cp_codeset['CODE'])

for col_name in cp_codeset.col_names.unique():
    def flag(d):
        if (d in codeset):
            name = cp_codeset[cp_codeset['CODES']==d].col_names
            if(name==col_name):
                return 1
            else:
                return 0

    cpf_udf = udf(flag, IntegerType())

    p.withColumn(col_name, cpf_udf(p.codes)).show()

The other option is to manually do it but in that case I have to write same udf function and call it with withColumn() function 75 times (which is the size of cp_codeset["col_names"])

Below is my two dataframes and I am trying to get how result appears

P (This is Pyspark dataframe and this dataframe is too large for pandas to handle)

id|codes
1|100
2|102
3|104

cp_codeset (pandas dataframe)

codes| col_names
100|a
101|b
102|c
103|d
104|e
105|f

result (pyspark dataframe)

id|codes|a|c|e
1|100   |1|0|0
2|102   |0|1|0   
3|104   |0|0|1
4
  • Possible duplicate of E-num / get Dummies in pyspark Commented Mar 27, 2017 at 17:54
  • Is that what you want? BTW why did you set [pandas] tag if you need a PySpark solution? It's confusing... Commented Mar 27, 2017 at 18:41
  • @MaxU Sorry for confusion.I only have cp_codeset as a Panda dataframe and have P as a pyspark dataframe. I want result in pyspark dataframe. Thank you. Commented Mar 27, 2017 at 18:45
  • @Viraj, in this case you have two answers from piRSquared and from Boud - it should be pretty straightforward to generate Spark DataFrame from Pandas DF... Commented Mar 27, 2017 at 18:50

2 Answers 2

2

With this data filtered:

cp_codeset.set_index('codes').loc[p.codes]
Out[44]: 
      col_names
codes          
100           a
102           c
104           e

Simply use get_dummies:

pd.get_dummies(cp_codeset.set_index('codes').loc[p.codes])
Out[45]: 
       col_names_a  col_names_c  col_names_e
codes                                       
100              1            0            0
102              0            1            0
104              0            0            1
Sign up to request clarification or add additional context in comments.

Comments

2

I'd use get_dummies with join + map

m = cp_codeset.set_index('codes').col_names

P.join(pd.get_dummies(P.codes.map(m)))

   id  codes  a  c  e
0   1    100  1  0  0
1   2    102  0  1  0
2   3    104  0  0  1

2 Comments

This gives error of "TypeError: 'Column' object is not callable". If I write P.select("codes").map(m) then it gives "AttributeError: 'DataFrame' object has no attribute '_jdf'"
@Viraj why are you writing P.select? That is no where in my code.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.