2

I have multiple pyspark dataframes that already exist. I need to add a new column to each dataframe. I can "hardcode" the solution and it works. However, when I try to use a for loop to add the column to all dataframes, I get an error. The error indicates that "withColumn" is not attribute of the dataframe. I don't understand this error.

What I've typed below is what I've tried. I'm a python and pyspark neophyte.

import pyspark class Row from module sql

from pyspark.sql import *

CREATE EXAMPLE DATA

----> Create some data for first dataframe

Labels = Row("firstName", "gender", "height", "age", "weight")
input1 = Row(firstname='Sam', gender='M', height = '77',age = '42',weight = '190')
input2 = Row(firstname='Diane', gender='F', height =  '70',age = '21', weight ='110')
input3 = Row(firstname='Norm',gender='M', height =  '68',age = '33',weight = '240')
input4 = Row(firstname='Carla', gender='F', height =  '60',age = '29',weight = '90')

----> Create the spark dataframe for first input

bios1 = Row(input1, input2, input3, input4)
bios1_sdf = spark.createDataFrame(bios1)

----> Create some more data

Labels = Row("firstName", "gender", "height", "age", "weight")
input1 = Row(firstname='Chandler', gender='M', height = '72',age = '109',weight = '270')
input2 = Row(firstname='Monica', gender='F', height =  '64',age = '10', weight ='123')
input3 = Row(firstname='Ross',gender='M', height =  '74',age = '59',weight = '168')
input4 = Row(firstname='Phoebe', gender='F', height =  '64',age = '2',weight = '20')

----> Create the spark dataframe for second input

bios2 = Row(input1, input2, input3, input4)
bios2_sdf = spark.createDataFrame(bios2)

HARD CODED METHOD

----> add bmi t dataframe one

bios1_sdf = bios1_sdf.withColumn('bmi',(bios1_sdf.weight/(bios1_sdf.height * bios1_sdf.height)*703))
print("data in bios1_sdf")
bios1_sdf.show()

----> add bmi to dataframe two

bios2_sdf = bios2_sdf.withColumn('bmi',(bios2_sdf.weight/(bios2_sdf.height * bios2_sdf.height)*703))
print("data in bios2_sdf")
bios2_sdf.show()

------> data in bios1_sdf

+---+---------+------+------+------+------------------+
|age|firstname|gender|height|weight|               bmi|
+---+---------+------+------+------+------------------+
| 42|      Sam|     M|    77|   190| 22.52825096980941|
| 21|    Diane|     F|    70|   110|15.781632653061223|
| 33|     Norm|     M|    68|   240|  36.4878892733564|
| 29|    Carla|     F|    60|    90|            17.575|
+---+---------+------+------+------+------------------+

------> data in bios2_sdf

+---+---------+------+------+------+------------------+
|age|firstname|gender|height|weight|               bmi|
+---+---------+------+------+------+------------------+
|109| Chandler|     M|    72|   270|36.614583333333336|
| 10|   Monica|     F|    64|   123|   21.110595703125|
| 59|     Ross|     M|    74|   168| 21.56756756756757|
|  2|   Phoebe|     F|    64|    20|      3.4326171875|
+---+---------+------+------+------+------------------+

DYNAMIC CODED METHOD

----> initialize dictionary

dict_of_df ={}

----> list of pyspark dataframes to add a column

list_of_sdf = [bios1_sdf, bios2_sdf]



for i in range(1,2+1):


    # name of dataframe
    key_name_in = 'bios'+str(i)+'_sdf'
    dict_of_df[key_name_in] = list_of_sdf[i-1]
    temp_sdf = dict_of_df[key_name_in]

    # add bmi
    dict_of_df[key_name_in] = temp_sdf.withcolumn('bmi',sdf.weight/(temp_sdf.height*temp_sdf.height)*703)

ERROR

I'm expecting to get these results. The same results as my "hardcoded" example.

-----> data in bios1_sdf

+---+---------+------+------+------+------------------+
|age|firstname|gender|height|weight|               bmi|
+---+---------+------+------+------+------------------+
| 42|      Sam|     M|    77|   190| 22.52825096980941|
| 21|    Diane|     F|    70|   110|15.781632653061223|
| 33|     Norm|     M|    68|   240|  36.4878892733564|
| 29|    Carla|     F|    60|    90|            17.575|
+---+---------+------+------+------+------------------+

-----> data in bios2_sdf

+---+---------+------+------+------+------------------+
|age|firstname|gender|height|weight|               bmi|
+---+---------+------+------+------+------------------+
|109| Chandler|     M|    72|   270|36.614583333333336|
| 10|   Monica|     F|    64|   123|   21.110595703125|
| 59|     Ross|     M|    74|   168| 21.56756756756757|
|  2|   Phoebe|     F|    64|    20|      3.4326171875|
+---+---------+------+------+------+------------------+
3
  • in your loop you've written withcolumn but it should be withColumn (capital C) Commented Aug 15, 2019 at 20:11
  • THANKS!! That worked. Commented Aug 15, 2019 at 20:22
  • Next time include the full traceback, which would have made this very easy to debug. Also don't do import *. Commented Aug 15, 2019 at 20:24

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.