0

I am trying evaluate each field in the if statement below.

However, I am running into the following error: Method col([class java.util.ArrayList]) does not exist.

What I am trying to achieve: I am trying to evaluate two fields in my dataframe - Name and Surname, in a Python function. In these fields, I have NULL values. For each field, I would like to identify if NULL values exist.

I am loading various datasets with fields that should be evaluated from each set. I would like to pass these fields into the function to check if NULL values exist.

def identifyNull(Field):

Field = ['Name', 'Surname'] - this is an example of what I would like to pass to my function. 

for x in Field:
  if df.select().filter(col(Field).isNull()).count() > 0:
    print(Field)
  else:
    print('False')

df = the dataframe name for the data I am reading.

df structure:

Name Surname
John Doe
NULL James
Lisa NULL

Please note: I am completely new to Python and Spark.

5
  • What is df exactly? Commented Apr 8, 2022 at 14:45
  • @Stefan df = the dataframe name for the data I am reading. Hope that makes sense. Commented Apr 8, 2022 at 14:52
  • 1
    Did you mean if df.select().filter(col( x).isNull()).count() > 0: and then print(x)? (Otherwise what would be the point interating your Field list? Commented Apr 8, 2022 at 14:54
  • @JNevill I would like to pass the two fields in the function. I am loading various datasets with fields that should be evaluated from each set. I would like to pass these fields into the function to check if NULL values exist. Commented Apr 8, 2022 at 15:03
  • 1
    Right. You want to pass each field, one at a time into the function, but you are passing the list into the function. x is your field. Your for loop is saying "Take each item in this list called fields and call that item x". Many programming languages use the syntax For Each x in fields which is a little clearer. Python just drops the Each so it isn't so verbose. Commented Apr 8, 2022 at 15:16

2 Answers 2

1

You're calling col(Field) with Field is a list. Since you're looping through fields, try with col(x) instead.

So it'd be something like this:

for x in Field:
    if df.where(F.col('Name').isNull()).count() > 0:
        print(x)
    else:
        print('False')
Sign up to request clarification or add additional context in comments.

Comments

0

Assuming

data = [["John", "Doe"], 
        [None, "James"],
        ["Lisa", None]]
Field = ["Name", "Surname"]
df = spark.createDataFrame(data, Field)
df.show()

returns:

+----+-------+
|Name|Surname|
+----+-------+
|John|    Doe|
|null|  James|
|Lisa|   null|
+----+-------+

Then

for x in Field:
    if df.select(x).where(x+" is null").count()>0:
        print(x)
    else:
        print(False)

returns:

Name
Surname

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.