2

I have a function like this:

  def number(row):
        if row['temp'] == '1 Person':
            return 'One'
        elif row['temp'] == '2 Persons':
            return 'Two'
        elif row['temp'] == '3 Persons':
            return 'Three'
        elif row['temp'] in ['4 Persons','5 Persons', '6 Persons', '7 Persons','8 Persons',
            '9 Persons','10 Persons','11 Persons']:
            return 'More'
        else:
            return None

Now, I want to change the values in my data frame looping through row-wise.

How can I loop through my data frame and replace the values according to the function above in Pyspark?

2 Answers 2

2
  1. Create a sample data frame with the data

     df_row = spark.createDataFrame(
     [("1 Person", "2", "3"),("2 Persons", "2", "3"),
      ("3 Persons", "2", "3"),("4 Persons", "2", "3"),
      ("5 Persons", "2", "3"),("6 Persons", "2", "3"),
      ("7 Persons", "2", "3"),("8 Persons", "2", "3"),
      ("9 Persons", "2", "3")],
      schema=['temp', 'col2', 'col3']
     )
    
  2. Define the function and also create the UDF function

     def number(row):
         if row == '1 Person':
             return 'One'
         elif row == '2 Persons':
             return 'Two'
         elif row == '3 Persons':
             return 'Three'
         elif row in ['4 Persons','5 Persons', '6 Persons', '7 Persons','8 Persons',
             '9 Persons','10 Persons','11 Persons']:
             return 'More'
         else:
             return None
    
     numberUDF = udf(lambda z: number(z),StringType())
    
  3. Rewrite the column function 'temp'

     df_row = df_row.withColumn('temp',numberUDF(col('temp')))
    

Please check the below output:

enter image description here

Sign up to request clarification or add additional context in comments.

Comments

1

The Python function can (almost identical) be translated into a native SQL statement:

from pyspark.sql import functions as F

df.withColumn('result', 
    F.expr("""case
                when temp == '1 Person' then 'One'
                when temp == '2 Persons' then 'Two'
                when temp == '3 Persons' then 'Three'                
                when temp in ('4 Persons', '5 Persons', '6 Persons', '7 Persons','8 Persons','9 Persons','10 Persons','11 Persons') then 'More'
                else null 
              end""")).show()

While an udf based solution offers more flexibility, the sql statements has a better performance as no Python code has to be run by Spark.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.