Saving multiple csv files from chunked csv file in pandas

Question

I am trying to load a very large (~25 million rows) csv file into pandas. I am chunking this file, 100,000 rows at a time, and appending a newly created dataframe that essentially counts the occurrences of certain words within a column of the chunked dataframe. When I save the first chunk, everything works properly, and the chunk is concatenated to the newly created dataframe side-by-side. The second chunk, however, is concatenated diagonally for some reason. By this I mean that the chunked dataframe is now 200,000 rows, with the first 100,000 rows being empty, and the newly created dataframe is concatenated side-by-side with the first 100,000 rows. How do I fix this problem and concatenate each chunk side by side with the newly created dataframe and save each chunk into separate csv files?

My code:

import pandas as pd
from pandas.core.frame import DataFrame

chunk = 1

for df in pd.read_csv('all_comments_data.csv', chunksize=100000):
    dict_to_append = {}

    with open('conflict_words.txt') as f:
        for word in f.readlines():
            dict_to_append[word.strip()] = []
    
    index = 0

    for comment in df['comment'].to_numpy():
        word_list = str(comment).split(" ")
        for conflict_word in dict_to_append.keys():
            dict_to_append[conflict_word].append(word_list.count(conflict_word))
        print(index)
        index +=1


    df_to_append = pd.DataFrame(dict_to_append)
    final_df = pd.concat([pd.DataFrame(df), df_to_append], axis=1)
    final_df.to_csv(f"all_comments_data_with_conflict_scores_{chunk}.csv")
    chunk += 1

What I need dataframes to look like:

---------------------------
|            |            |
|  chunk     | new dframe |
|            |            |
---------------------------

What the dataframes look like after the first chunk:

---------------------------
|            |            |
|            | new dframe |
|            |            |
---------------------------
|            |            |
| chunk      |            |
|            |            |
---------------------------

Can you attach some sample data (like 5-10 rows atleast)?

Aditya
– Aditya

2022-01-05 01:51:52 +00:00
Commented Jan 5, 2022 at 1:51 — Aditya
– Aditya, Commented Jan 5, 2022 at 1:51
I attached a little diagram, does that help?

BrianEbrahimi
– BrianEbrahimi

2022-01-05 02:05:35 +00:00
Commented Jan 5, 2022 at 2:05 — BrianEbrahimi
– BrianEbrahimi, Commented Jan 5, 2022 at 2:05

tnwei · Accepted Answer · 2022-01-05 02:40:23Z

1

When running pd.concat column-wise, pandas will try and match rows by their index:

df1 = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]}, index=[1, 3, 5])
df2 = pd.DataFrame({"A": [4, 5, 6], "B": [7, 8, 9]}, index=[2, 4, 6])
df3 = pd.concat([df1, df2], axis=1)
print(df3)

     A    B    A    B
1  1.0  4.0  NaN  NaN
2  NaN  NaN  4.0  7.0
3  2.0  5.0  NaN  NaN
4  NaN  NaN  5.0  8.0
5  3.0  6.0  NaN  NaN
6  NaN  NaN  6.0  9.0

If you want chunk and new dframe to sit side-by-side after a concat, you will need to ensure they both have the same row indices.

answered Jan 5, 2022 at 2:40

tnwei

9628 silver badges16 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

BrianEbrahimi Over a year ago

This was the right answer, thank you.

Collectives™ on Stack Overflow

Saving multiple csv files from chunked csv file in pandas

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related