How to optimize a nested for loop, looping over json data to extract values of certain keys in python

Ask Question

Asked 5 years, 9 months ago

Modified 5 years, 9 months ago

Viewed 235 times

I am reading log files in my python code which contains some nested json data. I have a nested for loop containing 4 for-loops from which values of certain keys are extracted and appended to a dataframe.

The nested for-loop is taking too much time and I saw from other answers that multiprocessing is the way to go for nested for-loops but did not find an example for json data.

What is the best approach for this ? Below is my code to extract data from log files and into dataframes. recommendation_list is a list of json objects.

    for recommendation in recommendation_list:
    if recommendation['type'] == "httpRequest":
        session_id = recommendation['query'].split('sessionId=')[1].split('&')[0]
        category_id = recommendation['query'].split('categoryId=')[1].split('&')[0]
    if recommendation['type'] == "httpResponse":
        recommendation_count = recommendation_count + 1
        user_id = recommendation['userId']
        time_stamp = recommendation['ts']
        event_date = time_stamp.split("T")[0]
        time = time_stamp.split("T")[-1]
        try:
            product_list = json.loads(recommendation['body'])['products']
        except:
            product_list = []
        if len(product_list) > 0:
            for product in product_list:
                product_id = product["id"]
                if 'recommendationMeta' in product:
                    data_frame = data_frame.append({
                        "transaction_id": last_id,
                        "user_id": user_id,
                        "session_id": session_id,
                        "category_id": category_id,
                        "product_id": product_id,
                        "date": event_date,
                        "time": time[0:12],
                        "event": "recommendation",
                        "ab_bucket": "B",
                        "recommendation_count": recommendation_count,
                    }, ignore_index=True)

                    for learning_unit in product['recommendationMeta']:
                        lu_name = learning_unit['lu']
                        lu_value = learning_unit['value']
                        recommendation_mode = learning_unit['recommendationMode']
                        prod_def1 = products[(products["product_id"] == product_id) &
                                             (products["lu_value"].str.lower() == lu_value)]
                        if len(prod_def1) != 0:
                            product_list = prod_def1.to_dict('records')

                            for product_id in product_list:
                                category = categories[(categories["category_def_id"] == product_id["category_def_id"]) &
                                                      (categories["lu_name"].str.lower() == lu_name)]
                                if len(category) != 0:
                                    product_def_id = product_id['product_def_id']
                                    lu_df = lu_df.append({
                                        "lu_data_id": lu_id,
                                        "product_def_id": product_def_id,
                                        "transaction_id": last_id,
                                        "rec_mode": recommendation_mode,
                                    }, ignore_index=True)
                                    lu_id = lu_id+1
                    last_id = last_id + 1

I figure that the innermost for-loop executes most number of times and decided to use multiprocessing for it.

I replaced

                                for product_id in product_list:
                                category = categories[(categories["category_def_id"] == product_id["category_def_id"]) &
                                                      (categories["lu_name"].str.lower() == lu_name)]
                                if len(category) != 0:
                                    product_def_id = product_id['product_def_id']
                                    lu_df = lu_df.append({
                                        "lu_data_id": lu_id,
                                        "product_def_id": product_def_id,
                                        "transaction_id": last_id,
                                        "rec_mode": recommendation_mode,
                                    }, ignore_index=True)
                                    lu_id = lu_id+1

with this...

for product_id in product_list:
      pool = Pool()  # Create a multiprocessing Pool
      data = pool.starmap(create_lu_data, [last_id, categories, recommendation_mode,
      lu_name, lu_df, lu_id, product_id])
      lu_id = lu_id + 1
      p.close()
      print(data)

where create_lu_data is

def create_lu_data(last_id, categories, recommendation_mode, lu_name, lu_df, lu_id, product_id):

category = categories[(categories["category_def_id"] == product_id["category_def_id"]) &
                      (categories["lu_name"].str.lower() == lu_name)]
if len(category) != 0:
    product_def_id = product_id['product_def_id']
    lu_df = lu_df.append({
        "lu_data_id": lu_id,
        "product_def_id": product_def_id,
        "transaction_id": last_id,
        "rec_mode": recommendation_mode,
        }, ignore_index=True)
return lu_df

I didn't get any errors, but the output dataframe has several times the expected number of rows.

edited Mar 2, 2020 at 5:29

asked Feb 25, 2020 at 4:19

ShashankAC

1,07813 silver badges28 bronze badges

Can you please add the the stack trace? i.e. what line of code did that error come from.

Paul
– Paul

2020-02-28 14:08:56 +00:00
Commented Feb 28, 2020 at 14:08
@Paul, I have updated the question. I fixed the error but the output is wrong.

ShashankAC
– ShashankAC

2020-03-02 05:29:41 +00:00
Commented Mar 2, 2020 at 5:29
1

Pool can be used to parallel-execute same logic with different inputs. I don't see what input ranges are handled here. You are creating the Pools with single task in a loop if I understand correctly.

Kris
– Kris

2020-03-02 05:39:26 +00:00
Commented Mar 2, 2020 at 5:39

Add a comment |

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

How to optimize a nested for loop, looping over json data to extract values of certain keys in python

0

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Linked