I am trying to load a very large (~25 million rows) csv file into pandas. I am chunking this file, 100,000 rows at a time, and appending a newly created dataframe that essentially counts the occurrences of certain words within a column of the chunked dataframe. When I save the first chunk, everything works properly, and the chunk is concatenated to the newly created dataframe side-by-side. The second chunk, however, is concatenated diagonally for some reason. By this I mean that the chunked dataframe is now 200,000 rows, with the first 100,000 rows being empty, and the newly created dataframe is concatenated side-by-side with the first 100,000 rows. How do I fix this problem and concatenate each chunk side by side with the newly created dataframe and save each chunk into separate csv files?
My code:
import pandas as pd
from pandas.core.frame import DataFrame
chunk = 1
for df in pd.read_csv('all_comments_data.csv', chunksize=100000):
dict_to_append = {}
with open('conflict_words.txt') as f:
for word in f.readlines():
dict_to_append[word.strip()] = []
index = 0
for comment in df['comment'].to_numpy():
word_list = str(comment).split(" ")
for conflict_word in dict_to_append.keys():
dict_to_append[conflict_word].append(word_list.count(conflict_word))
print(index)
index +=1
df_to_append = pd.DataFrame(dict_to_append)
final_df = pd.concat([pd.DataFrame(df), df_to_append], axis=1)
final_df.to_csv(f"all_comments_data_with_conflict_scores_{chunk}.csv")
chunk += 1
What I need dataframes to look like:
---------------------------
| | |
| chunk | new dframe |
| | |
---------------------------
What the dataframes look like after the first chunk:
---------------------------
| | |
| | new dframe |
| | |
---------------------------
| | |
| chunk | |
| | |
---------------------------