udf to parse string json in pyspark dataframe

Question

I have a pyspark dataframe which contains string json. Looks like below:

+---------------------------------------------------------------------------+
|col                                                                        | 
+---------------------------------------------------------------------------+
|{"fields":{"list1":[{"list2":[{"list3":[{"type":false}]}]}]}}            | 
+----------------------------------------------------------------------------+--

I wrote udfs to try to parse the json and then count the value that matches phone and return to a new column in df

def item_count(json,type):
    count=0
    for i in json.get("fields",{}).get("list1",[]):
        for j in i.get("list2",[]):
            for k in j.get("list3",[]):
                count+=k.get("type",None)==type
    return count

def item_phone_count(json):
    return item_count(json,False)

df2= df\
.withColumn('item_phone_count', (F.udf(lambda j: item_phone_count(json.loads(j)), t.StringType()))('col'))

But I got the error:

AttributeError: 'NoneType' object has no attribute 'get'

Any idea what's wrong?

It looks like one of your variables in item_count() is None, but there is no way to figure out which one from the information you've posted. Please post the full error traceback and an minimal reproducible example with enough information so that someone else can reproduce your error. — Craig
– Craig, Commented Dec 11, 2020 at 23:04
That is a possible cause of the error that you are seeing. Try printing them in the loop to see if one of them is None. — Craig
– Craig, Commented Dec 13, 2020 at 19:01
@Craig how can I print it since I am calling the udf from the pyspark dataframe? — kihhfeue
– kihhfeue, Commented Dec 13, 2020 at 19:42
@kihhfeue try to get a few entries from your dataframe and put them into the function manually and see what happens — mck
– mck, Commented Dec 13, 2020 at 19:53

mck · Accepted Answer · 2020-12-13 20:06:54Z

1

Check for none and skip those entries:

def item_count(json,type):
    count = 0
    if (json is None) or (json.get("fields",{}) is None):
        return count  
   
    for i in json.get("fields",{}).get("list1",[]):
        if i is None:
            continue
        for j in i.get("list2",[]):
            if j is None:
                continue 
            for k in j.get("list3",[]):
                if k is None:
                    continue 
                count += k.get("type",None) == type
    return count

answered Dec 13, 2020 at 20:06

mck

42.7k13 gold badges44 silver badges62 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

kihhfeue Over a year ago

the error is gone now but not sure why I still get 0 counts when I checked the original json and there is definitely value that matches the condition. Does json.load change the format of type?

kihhfeue Over a year ago

I made some edit to the question. The type value is false

Collectives™ on Stack Overflow

udf to parse string json in pyspark dataframe

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related