How to create a dataframe by importing data from multiple .csv files that are alike in contents?

Question

I have been struggling with this issue for hours now and I can't seem to figure it out. I would really appreciate it for any input that would help.

Background

I am trying to automate data manipulation for my research lab in school through python. From the experiment, a .csv file containing 41 rows of data excluding header will be produced as seen below.

Sometimes, multiple runs of the same experiment exist and that will produce .csv files with the same header, and taking an average of them is needed for accuracy. Something like this with the same number of rows and headers:

So far I was able to filter the basenames to only contain the .csv files of the same parameters and have them added to a data frame. However, my issue is that I don't know how to continue to get an average.

My Current Code and output

Code:

import pandas as pd
import os

dir = "/Users/luke/Desktop/testfolder"

files = os.listdir(dir)
files_of_interests = {}

for filename in files:
    if filename[-4:] == '.csv':
        key = filename[:-5]
        files_of_interests.setdefault(key, [])
        files_of_interests[key].append(filename)

print(files_of_interests)

for key in files_of_interests:
    stack_df = pd.DataFrame()
    print(stack_df)
    for filename in files_of_interests[key]:
        stack_df = stack_df.append(pd.read_csv(os.path.join(dir, filename)))
    print(stack_df)

Output:

Empty DataFrame
Columns: []
Index: []
    Unnamed: 0  Wavelength       S2c  Wavelength.1        S2
0            0        1100  0.000342          1100  0.000304
1            1        1110  0.000452          1110  0.000410
2            2        1120  0.000468          1120  0.000430
3            3        1130  0.000330          1130  0.000306
4            4        1140  0.000345          1140  0.000323
..         ...         ...       ...           ...       ...
36          36        1460  0.002120          1460  0.001773
37          37        1470  0.002065          1470  0.001693
38          38        1480  0.002514          1480  0.002019
39          39        1490  0.002505          1490  0.001967
40          40        1500  0.002461          1500  0.001891

[164 rows x 5 columns]

Question Here!

So my question is, how do I get it to append towards the right individually for each S2c and S2?

Explanation:

With multiple .csv files with the same header names, when I append it to the list it just keeps stacking towards the bottom of the previous .csv file which led to the [164 rows x 5 columns] from the previous section. My original idea is to create a new data frame and only appending S2c and S2 from each of those .csv files such that instead of stacking on top of one another, it will keep appending them as new columns towards the right. Afterward, I can do some form of pandas column manipulation to have them added and divided by the number of runs (which are just the number of files, so len(files_of_interests[key]) under the second FOR loop).

What I have tried

I have tried creating an empty data frame and adding a column that is taken from np.arange(1100,1500,10) using pd.DataFrame.from_records(). And append S2c and S2 to the data frame as I have described from the previous section. The same issue occurred, in addition to that, it produces a bunch of Nan values which I am not too well equipped to deal with even after searching further.
I have read up on multiple other questions posted here, many suggested using pd.concat but since the answers are tailored to a different situation, I can't really replicate it nor do was I able to understand the documentation for it so I stopped pursuing this path.

Thank you in advance for your help!

Additional Info

I am using macOS and ATOM for the code.

The csv files can be found here!

github: https://github.com/teoyi/PROJECT-Automate-Research-Process

Trying out @zabop method

Code:

dflist = []
for key in files_of_interests:
    for filename in files_of_interests[key]:
        dflist.append(pd.read_csv(os.path.join(dir, filename)) )
concat = pd.concat(dflist, axis = 1)
concat.to_csv(dir + '/concat.csv')

Output:

Trying @SergeBallesta method

Code:

df = pd.concat([pd.read_csv(os.path.join(dir, filename))
                for key in files_of_interests for filename in files_of_interests[key]])

df = df.groupby(['Unnamed: 0', 'Wavelength', 'Wavelength.1']).mean().reset_index()
df.to_csv(dir + '/try.csv')
print(df)

Output:

I assume that the first column (which is name Unnamed: 0 here) consistently contains the number from 1 to 40. Do the Wavelength[1] also contain the exact same data across the different files? — Serge Ballesta
– Serge Ballesta, Commented Aug 1, 2020 at 7:38
What do you mean exactly by "append towards the right individually for each S2c and S2"? — zabop
– zabop, Commented Aug 1, 2020 at 7:41
@SergeBallesta Yes your assumption would be right but unnamed would go from 0 to 40, making that 41 rows instead. Sorry about that! And yes, Wavelength[1] will always go from 1100 to 1500 with increments of 10. — Luke Teo
– Luke Teo, Commented Aug 1, 2020 at 7:55
@zabop by that I meant as the for loop goes through each file, it will add columns to the data frame as such: S2c_1, S2_1, S2c_2, S2_2, ... Hope that clears up the confusion! — Luke Teo
– Luke Teo, Commented Aug 1, 2020 at 7:57
Thanks. Added my solution, maybe I misunderstood, let me know if there are issues. — zabop
– zabop, Commented Aug 1, 2020 at 8:11

Serge Ballesta · Accepted Answer · 2020-08-02 07:50:29Z

1

IIUC you have:

a bunch of csv file, each containing a result from the same experiment
the first relevant column always contains numbers from 0 to 40 (so there are 41 lines per file)
the Wavelenght and Wavelength.1 columns always contain same values from 1100 to 1500 with a 10 increment
but additional columns may exist before the first relevant one
the first column has no name in the csv file, and up to the first relevant one names start with 'Unnamed: '

and you would like to get the average values of the S2 and S2c column for the same Wavelength value.

This can be done simply with groupby and mean, but we first have to filter out all the unnecessay columns. It can be made through the index_col and usecols parameter of read_csv:

...
print(files_of_interests)

# first concat the datasets:
dfs = [pd.read_csv(os.path.join(dir, filename), index_col=1,
                   usecols=lambda x: not x.startswith('Unnamed: '))
       for key in files_of_interests for filename in files_of_interests[key]]
df = pd.concat(dfs).reset_index()

# then take the averages
df = df.groupby(['Wavelength', 'Wavelength.1']).mean().reset_index()

# reorder columns and add 1 to the index to have it to run from 1 to 41
df = df.reindex(columns=['Wavelength', 'S2c', 'Wavelength.1', 'S2'])
df.index += 1

If there are still unwanted columns in resulting df, this magic command will help to identify the original files having a weird struct:

import pprint

pprint.pprint([df.columns for df in files])

With the files from github testfolder, it gives:

[Index(['Unnamed: 0', 'Wavelength', 'S2c', 'Wavelength.1', 'S2'], dtype='object'),
 Index(['Unnamed: 0', 'Wavelength', 'S2c', 'Wavelength.1', 'S2'], dtype='object'),
 Index(['Unnamed: 0', 'Wavelength', 'S2c', 'Wavelength.1', 'S2'], dtype='object'),
 Index(['Unnamed: 0', 'Wavelength', 'S2c', 'Wavelength.1', 'S2'], dtype='object'),
 Index(['Unnamed: 0', 'Unnamed: 0.1', 'Wavelength', 'S2c', 'Wavelength.1',
       'S2'],
      dtype='object'),
 Index(['Unnamed: 0', 'Wavelength', 'S2c', 'Wavelength.1', 'S2'], dtype='object')]

It makes clear that the fifth file as an additional columns.

edited Aug 2, 2020 at 7:50

answered Aug 1, 2020 at 12:45

Serge Ballesta

150k13 gold badges137 silver badges267 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Luke Teo Over a year ago

Hi! Thank you for getting back to me, I have given your code a try and have since changed key for the last for loop to files_of_interest[key] as leaving it as key seems to have failed to use the filename resulting in file not found error. I have also edited my post to show the code and the output. Not quite sure what happened or where it went wrong as the code made sense to me but the output turned really chaotic as it now spans [239 rows x 104 columns].

Luke Teo Over a year ago

I have added the .csv files in my github repo if that helps make it easier to work with my question! The link can be found under Additional Information and the files can be found under testfolder if you are still interested in helping me!

Serge Ballesta Over a year ago

The problem was in the input files. Garbage in, garbage out... But here it looks easy to filter out the unwanted columns

Luke Teo Over a year ago

Thank you for pointing that out! That is most definitely a mistake on my part and it seems to have been due to me messing with the files multiple times. Just one question, what exactly does .reset_index() do ?

Serge Ballesta Over a year ago

reset_index takes the current index (or multi-index) and makes it a normal column (or columns if multi). It creates a brand new index (range from 0 to size-1). I use it here because grouby.mean stores the grouping columns in the index.

zabop · Accepted Answer · 2020-08-01 08:28:19Z

1

If you have a list of dataframes, for example:

import pandas as pd
data = {'col_1': [3, 2, 1, 0], 'col_2': [3, 1, 2, 0]}
dflist = [pd.DataFrame.from_dict(data) for _ in range(5)]

You can do:

pd.concat(dflist,axis=1)

Which will look like:

If you want to append each column name with a number indicating which df they came from, before concat, do:

for index, df in enumerate(dflist):
    df.columns = [col+'_'+str(index) for col in df.columns]

Then pd.concat(dflist,axis=1), resulting:

While I can't reproduce your file system & confirm that this works, to create the dflist above from you files, something like this should work:

dflist = []
for key in files_of_interests:
    print(stack_df)
    for filename in files_of_interests[key]:
        dflist.append( pd.read_csv(os.path.join(dir, filename)) )

edited Aug 1, 2020 at 8:28

answered Aug 1, 2020 at 8:02

zabop

8,1124 gold badges56 silver badges112 bronze badges

5 Comments

Luke Teo Over a year ago

my dict is currently having a value that is a list of .csv files. Would it still work for the dflist you have (I would think I probably need another for loop for that, would that be correct?)? I think I understand what concat does now based on your answer so I will definitely give that route another try!

zabop Over a year ago

Yeah, it should work. I used dicts to make it reproducible, but you can create dflist any way you like.

zabop Over a year ago

(Added a way to produce dflist.)

Luke Teo Over a year ago

So i fiddled around with it for a bit, I have edited my post to show my code + the output that I got when saved into csv format. That was definitely not what I expected compared to the examples you have given in your post!

Luke Teo Over a year ago

I could have also probably saved it wrong but I will definitely revisit the code again tomorrow as it is currently 2AM. Do let me know of what you think though!

Luke Teo · Accepted Answer · 2020-08-02 02:42:05Z

Turns out both @zabop and @SergeBallesta have provided me with valuable insights on to work on this issue through pandas.

What I wanted to have:

The respective S2c and S2 columns of each file within the key:value pairs to be merged into one .csv file for further manipulation.
Remove redundant columns to only show a single column of Wavelength that ranges from 1100 to 1500 with an increment of 10.

This requires the use of pd.concat which was introduced by @zabop and @SergeBallesta as shown below:

for key in files_of_interests:
    list = []
    for filename in files_of_interests[key]:
        list.append(pd.read_csv(os.path.join(dir,filename)))
        df = pd.concat(list, axis = 1)
        df = df.drop(['Unnamed: 0', 'Wavelength.1'], axis = 1)
        print(df)
        df.to_csv(os.path.join(dir + '/', f"{filename[:-5]}_master.csv"))

I had to use files_of_interests[key] for it to be able to read the filenames and have pd.read_csv to read the correct path. Other than that, I added axis = 1 to pd.concat which allows it to be concatenated horizontally along with for loops to access the filenames correctly. (I have double-checked the values and they do match up with the respective .csv files.)

The output to .csv looks like this:

The only issue now is that groupby as suggested by @SergeBallesta did not work as it returns ValueError: Grouper for 'Wavelength' not 1-dimensional. I will be creating a new question for this if I make no progress by the end of the day.

Once again, a big thank you to @zabop and @SergeBallesta for giving this a try though my explanation wasn't too clear, their answers have definitely provided me with the much-needed insight of how pandas work.

Collectives™ on Stack Overflow

How to create a dataframe by importing data from multiple .csv files that are alike in contents?

3 Answers 3

5 Comments

5 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

5 Comments

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related