0

I have a dataframe where column has an array and each element is a dictionary.

class product
{"deleteDate": null, "class":"AB", "validFrom": "2022-09-01", "validTo": "2009-08-31"}, {"deleteDate": null, "class":"CD", "validFrom": "2009-09-01", "validTo": "2024-08-31"} {"deleteDate": "2021-09-01", "class":"AB", "validFrom": "2003-09-01", "validTo": "2009-03-01"}, {"deleteDate": null, "class":"CD", "validFrom": "2009-09-01", "validTo": "2024-08-31"}

I am trying to filter an element base on a few conditions.

def getelement(value,entity):
    list_url = []
    for i in range(len(value)):
        if value[i] is not None and (value[i].deleteDate is None):
            if (value[i].validFrom <= (Date of Today)) & (value[i].validFrom >= (date of today)):

            list_url.append(value[i].entity)
    if list_url:
        return str(list_url[-1])
    if not list_url:
        return None

    
udfgeturl=F.udf(lambda z: getelement(z) if not z is None else "" , StringType() )



master = df.withColumn( 'ClassName', udfgeturl('Class')) 

The function takes two elements, value and entity. where value refers to column name and entity refers to a key in dictionary for which I want to save the result. The code works with one element getelement(value) for UDF but I do not know how the UDF can take two arguments, any suggestion, please?

2 Answers 2

1

To improve the performance (Spark functions vs UDF performance?), you could use only spark transformations:
I'm assuming (value[i].validFrom >= (date of today)) is supposed to actually be (value[i].validTo >= (date of today))

import pyspark.sql.functions as f

def getelement(value, entity):

    df = (
        df
        .withColumn('output', f.expr(f'filter({value}, element -> (element.deleteDate is null) AND (element.validFrom <= current_date()) AND (element.validTo >= current_date()))')[entity][-1])
    )

    return df
Sign up to request clarification or add additional context in comments.

2 Comments

Thank you it worked, except I needed to change "&" with AND.
You’re right; fixed.
0

You can use a struct to bundle the parameters into 1 object. Then access the elements of the struct with . operator.

Code example:

def getelement(object):
  value = object.value
  entity = object.entity
  return str( entity + " " + value )

udfgeturl=f.udf(getelement , StringType() )
df.select( 
 udfgeturl( 
  f.struct( 
   f.col("col1").alias("value"), 
   f.col("col2").alias("entity")) 
  ) 
 ).show()

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.