Is there a way to optimize concatenation of files instead of appending file data to a python list?

Question

I'm making a file compiler that takes data from csv files and after every 8 files read concatenates the data into a separate csv file. The program is taking quite long to be able to do these functions. is there a more optimized way to go about this?

I'm currently reading the csv file into a pandas data frame then appending said data frames into a list to compile them for pd.concat() after.

edit: The inputs used in the pd.read_csv call is the root's directory and the files name that's being read since im using os.walk to jump from folder to folder. The content in each of the folders is an inconsistent amount of csv files storing data for a model's MSE RMSE and MAE. the reason why im using a data frame is because im trying to use the data in each of the csv files for further data analysis(reason why it concatenates every 8 files is because each model has 8 outputs). All csv files have one row for a header and are 6 columns by 5 rows.

code snippet:

data = []

data_value = pd.read_csv(os.path.join(root, file), sep='\t') #Reading data into df

data.append(data_value) # appending df to a list

pd.concat(data) #concatenating all data in list into a data frame

It isn't clear what you are trying to accomplish exactly. Can you please describe exactly what your inputs are (a list of file paths?) and what is the output you are looking for (you are creating a dataframe in your code, but you said you want to ouput into a seperate csv, so are you just trying to aggregate every 8 files into 1 file on disk? or do you actually need a dataframe?). Is your only purpose of using pandas to read/write the csv, or are you actually using the dataframes? Do the csv files have identical structure? What is that structure, approximately (is there a header row? ) — juanpa.arrivillaga
– juanpa.arrivillaga, Commented Feb 25 at 19:23
Sorry, it's my first time posting here. I made an edit to the post for more information. Thanks for trying to help out! — Maor Barzilay
– Maor Barzilay, Commented Feb 25 at 21:54
Why do you need to use pandas at all? Just concatenate the files directly. The only complication may be filtering out the duplicate header lines. — Barmar
– Barmar, Commented Feb 25 at 22:04
I was going to use the mean and stdev functions in pandas for each column and make the new concatenated file have these values on the bottom of the table. Would it be better to just use shutil and read the files after into a data frame for this? — Maor Barzilay
– Maor Barzilay, Commented Feb 25 at 22:19

franjefriten · Accepted Answer · 2025-02-25 21:05:31Z

0

As stated by others, this question is too generic and doesn't provide much info about the issue. However, the best thing you can do is to simply read all files separately and concat them without creating said list like that and appending constantly.

df1 = pd.read_csv(path_to_file1, ...)
df2 = pd.read_csv(path_to_file2, ...)
df3 = pd.read_csv(path_to_file3, ...)
df4 = pd.read_csv(path_to_file4, ...)
df5 = pd.read_csv(path_to_file5, ...)
df6 = pd.read_csv(path_to_file6, ...)
df7 = pd.read_csv(path_to_file7, ...)
df8 = pd.read_csv(path_to_file8, ...)

df_final = pd.concat(
  [df1, df2, df3, df4, df5, df6, df7, df8],
  **kwargs
)

Or you could just concatenate 2 files per execution and store the resulting file and do it recursively until only two files are to concat. Note that, when I mean recursively, I don't mean coding a recursive function, since it would be too memory costly. Create a script to concat 2 files and store the result and then use that result as one of the dfs to concat in the next execution of the script.

answered Feb 25 at 21:05

franjefriten

521 silver badge9 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

juanpa.arrivillaga Feb 25 at 21:25

why would this be more efficient? The efficiency here would be the same. And also, "Create a script to concat 2 files and store the result and then use that result as one of the dfs to concat in the next execution of the script." that would be very inefficient. Don't do that

franjefriten Feb 25 at 21:47

@juanpa.arrivillaga to be fair, bairly anything else could be said with what I ahd in hand. But I won't deny that you are right.

Barmar Feb 25 at 22:03

If the question is unclear, don't bother answering it, wait for them to improve the question.

Collectives™ on Stack Overflow

Is there a way to optimize concatenation of files instead of appending file data to a python list?

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related