1

I would like to merge (using df.append()) some python dataframes by rows. The code below reported starts by reading all the json files that are in the input json_dir_path, it reads input_fn = json_data["accPreparedCSVFileName"] that contains the full path where the csv file is store and read it in the data frame df_i. When I try to merge df_output = df_i.append(df_output) I do not obtained the desired results.

    def __merge(self, json_dir_path):
    if os.path.exists(json_dir_path):
        filelist = [f for f in os.listdir( json_dir_path )]

        df_output = pd.DataFrame()
        for json_fn in filelist:
            json_full_name = os.path.join( json_dir_path, json_fn )
            # print("[TrainficationWorkflow::__merge] We are merging the json file ", json_full_name)
            if os.path.exists(json_full_name):
                with open(json_full_name, 'r') as in_json_file:
                    json_data = json.load(in_json_file)
                    input_fn = json_data["accPreparedCSVFileName"]
                    df_i = pd.read_csv(input_fn)
                    df_output = df_i.append(df_output)
        return df_output
    else:
        return pd.DataFrame(data=[], columns=self.DATA_FORMAT)

I got only 2 files are merged out of 12. What am I doing wrong?

Any help would be very appreciated.

Best Regards, Carlo

1

2 Answers 2

1

You can also set ignore_index=True when appending.

df_output = df_i.append(df_output, ignore_index=True)

Also you can concatenate the dataframes:

df_output = pd.concat((df_output, df_i), axis=0, ignore_index=True)

As @jpp suggested in his answer, you can load the list of dataframes and concatenate them in 1 go.

Sign up to request clarification or add additional context in comments.

7 Comments

Thanks for your answer. Sure. Do you agree that my code is also correct?
yes, your code is correct as per what I can see. Can you print json_full_name in the while loop and see if all 12 file names get printed?
yes I did. it does. Moreover, I compared the number of rows that I get with your approach and mine, they are the same number.
and the number of rows do not match the sum of rows from individual files?
they do. That why I come to the conclusion that also my code is correct.
|
1

I strongly recommend you do not concatenate dataframes in a loop.

It is much more efficient to store your dataframes in a list, then concatenate items of your list in one call. For example:

lst = []

for fn in input_fn:
    lst.append(pd.read_csv(fn))

df_output = pd.concat(lst, ignore_index=True)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.