1

I am trying to load a very large (~25 million rows) csv file into pandas. I am chunking this file, 100,000 rows at a time, and appending a newly created dataframe that essentially counts the occurrences of certain words within a column of the chunked dataframe. When I save the first chunk, everything works properly, and the chunk is concatenated to the newly created dataframe side-by-side. The second chunk, however, is concatenated diagonally for some reason. By this I mean that the chunked dataframe is now 200,000 rows, with the first 100,000 rows being empty, and the newly created dataframe is concatenated side-by-side with the first 100,000 rows. How do I fix this problem and concatenate each chunk side by side with the newly created dataframe and save each chunk into separate csv files?

My code:

import pandas as pd
from pandas.core.frame import DataFrame

chunk = 1

for df in pd.read_csv('all_comments_data.csv', chunksize=100000):
    dict_to_append = {}

    with open('conflict_words.txt') as f:
        for word in f.readlines():
            dict_to_append[word.strip()] = []
    
    index = 0

    for comment in df['comment'].to_numpy():
        word_list = str(comment).split(" ")
        for conflict_word in dict_to_append.keys():
            dict_to_append[conflict_word].append(word_list.count(conflict_word))
        print(index)
        index +=1


    df_to_append = pd.DataFrame(dict_to_append)
    final_df = pd.concat([pd.DataFrame(df), df_to_append], axis=1)
    final_df.to_csv(f"all_comments_data_with_conflict_scores_{chunk}.csv")
    chunk += 1 
What I need dataframes to look like:

---------------------------
|            |            |
|  chunk     | new dframe |
|            |            |
---------------------------

What the dataframes look like after the first chunk:

---------------------------
|            |            |
|            | new dframe |
|            |            |
---------------------------
|            |            |
| chunk      |            |
|            |            |
---------------------------
2
  • Can you attach some sample data (like 5-10 rows atleast)? Commented Jan 5, 2022 at 1:51
  • I attached a little diagram, does that help? Commented Jan 5, 2022 at 2:05

1 Answer 1

1

When running pd.concat column-wise, pandas will try and match rows by their index:

df1 = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]}, index=[1, 3, 5])
df2 = pd.DataFrame({"A": [4, 5, 6], "B": [7, 8, 9]}, index=[2, 4, 6])
df3 = pd.concat([df1, df2], axis=1)
print(df3)

     A    B    A    B
1  1.0  4.0  NaN  NaN
2  NaN  NaN  4.0  7.0
3  2.0  5.0  NaN  NaN
4  NaN  NaN  5.0  8.0
5  3.0  6.0  NaN  NaN
6  NaN  NaN  6.0  9.0

If you want chunk and new dframe to sit side-by-side after a concat, you will need to ensure they both have the same row indices.

Sign up to request clarification or add additional context in comments.

1 Comment

This was the right answer, thank you.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.