0

I have thousands csv files names as follows file_x_x.csv where x is a number between 1 and 10000, in a same folder. Each file includes a header and one row of data:

file_1_1.csv

Name Surname Age Address
Michael O'Donnel 22 George St.

file_2_2.csv

Name Surname Age Address
Mary Jane 34 Camden St.

and so on.

I am looking for creating one single file including all these rows:

final_file.csv

Name Surname Age Address
Michael O'Donnel 22 George St.
Mary Jane 34 Camden St.

...

My approach:

import pandas as pd
import glob

path = # add path
all_files = glob.glob(path + ".csv") # look for all the csv files in that folder. Probably this is not the right code for looking at them

file_list = []

for filename in all_files:
    df = pd.read_csv(filename)
    file_list(df)

I do not know how to create one unique file at the end. Can you have a look at the code above and tell me how to get the desired output and if I missed something?

5
  • What you described isn't actually a csv file. CSV means "Comma Separated Values" and they should be separated by , not a space, so pandas won't read the values correctly. Commented Oct 3, 2021 at 23:29
  • @ThomasQ - there are several conventions for CSV - commas, tabs, vertical bars, spaces, semicolons. pandas lets you set the separator. There may be issues with how to deal with spaces internal to a cell that should be escaped, but generally it works. Commented Oct 3, 2021 at 23:32
  • @tdelaney I know there are many conventions, but even using sep=" " , pandas wouldn't have loaded the csv in the example correctly, since the addresses are George St. and Camden St. Commented Oct 3, 2021 at 23:42
  • @ThomasQ - Right. CSV is an term for several similar formats and these examples aren't conformant with any of them. Best not to try to parse them and just copy if you can. Commented Oct 3, 2021 at 23:46
  • Possibly you can use glob library and use a for look for all the file which has extention .csv and read them and append then in one dataframe. If you have list of dataframe which generated from a for look then use pd.concat[df1, df2] and outside the for loop save it. Commented Apr 27, 2023 at 10:11

2 Answers 2

3

You don't need to do anything complicated here. You know the header line and you know that you want the final to be everything except the header. Just open the files, skip the first line, and write. This is far more efficient than the memory consumption of a bunch of dataframes in memory.

import glob

with open("final_file.csv", "w") as outfile:
    for count, filename in enumerate(glob.glob(path + ".csv")):
        with open(filename) as infile:
            header = next(infile)
            if count == 0:
                outfile.write(header)
            line = next(infile)
            if not line.startswith("\n"):
                line = line + "\n"
            outfile.write(line)
Sign up to request clarification or add additional context in comments.

1 Comment

is there a way to not manually write/select the header, but just to keep one from the first file? It might be in future to have more columns. thanks
2

I'd suggest using pd.concat to combine the DataFrames into one large DataFrame, which you can then save to a different file if you wish.

Before you concatenate the DataFrames, you may have to modify the call to pd.read_csv to ensure that the data is being processed correctly. If the example data in your question matches the contents of the CSV file verbatim, then the code snippet would look something like this:

import pandas as pd
import glob

path = "/my_path" # set this to the folder containing CSVs
names = glob.glob(path + "*.csv") # get names of all CSV files under path

# If your CSV files use commas to split fields, then the sep 
# argument can be ommitted or set to ","
file_list = pd.concat([pd.read_csv(filename, sep=" ") for filename in names])

#save the DataFrame to a file
file_list.to_csv("combined_data.csv")

Note that each row in the combined index will still be indexed based on the row number in its source file, creating duplicate row indices. To change that, call pd.DataFrame.reset_index()

file_list = file_list.reset_index(drop=True)

4 Comments

You'll need sep=" " on the read_csv.
@tdelaney I originally made the assumption that the example data in OP's examples was actually just a printed DataFrame, instead of lines from the actual CSV. I'll clarify that further in my response. Thank you for pointing that out.
I received a message ValueError: No objects to concatenate in names = glob.glob(path + ".csv") # get names of all CSV files under path. I guess something is missing here (for example something to select all the csv files, regardless the name)
@LdM apologies, it should be path + “*.csv” instead of path + “.csv” I’ll edit my response

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.