3

I have been struggling with this issue for hours now and I can't seem to figure it out. I would really appreciate it for any input that would help.

Background

I am trying to automate data manipulation for my research lab in school through python. From the experiment, a .csv file containing 41 rows of data excluding header will be produced as seen below.

enter image description here

Sometimes, multiple runs of the same experiment exist and that will produce .csv files with the same header, and taking an average of them is needed for accuracy. Something like this with the same number of rows and headers:

enter image description here

So far I was able to filter the basenames to only contain the .csv files of the same parameters and have them added to a data frame. However, my issue is that I don't know how to continue to get an average.

My Current Code and output

Code:

import pandas as pd
import os

dir = "/Users/luke/Desktop/testfolder"

files = os.listdir(dir)
files_of_interests = {}

for filename in files:
    if filename[-4:] == '.csv':
        key = filename[:-5]
        files_of_interests.setdefault(key, [])
        files_of_interests[key].append(filename)

print(files_of_interests)

for key in files_of_interests:
    stack_df = pd.DataFrame()
    print(stack_df)
    for filename in files_of_interests[key]:
        stack_df = stack_df.append(pd.read_csv(os.path.join(dir, filename)))
    print(stack_df)

Output:

Empty DataFrame
Columns: []
Index: []
    Unnamed: 0  Wavelength       S2c  Wavelength.1        S2
0            0        1100  0.000342          1100  0.000304
1            1        1110  0.000452          1110  0.000410
2            2        1120  0.000468          1120  0.000430
3            3        1130  0.000330          1130  0.000306
4            4        1140  0.000345          1140  0.000323
..         ...         ...       ...           ...       ...
36          36        1460  0.002120          1460  0.001773
37          37        1470  0.002065          1470  0.001693
38          38        1480  0.002514          1480  0.002019
39          39        1490  0.002505          1490  0.001967
40          40        1500  0.002461          1500  0.001891

[164 rows x 5 columns]

Question Here!

So my question is, how do I get it to append towards the right individually for each S2c and S2?

Explanation:

With multiple .csv files with the same header names, when I append it to the list it just keeps stacking towards the bottom of the previous .csv file which led to the [164 rows x 5 columns] from the previous section. My original idea is to create a new data frame and only appending S2c and S2 from each of those .csv files such that instead of stacking on top of one another, it will keep appending them as new columns towards the right. Afterward, I can do some form of pandas column manipulation to have them added and divided by the number of runs (which are just the number of files, so len(files_of_interests[key]) under the second FOR loop).

What I have tried

  1. I have tried creating an empty data frame and adding a column that is taken from np.arange(1100,1500,10) using pd.DataFrame.from_records(). And append S2c and S2 to the data frame as I have described from the previous section. The same issue occurred, in addition to that, it produces a bunch of Nan values which I am not too well equipped to deal with even after searching further.

  2. I have read up on multiple other questions posted here, many suggested using pd.concat but since the answers are tailored to a different situation, I can't really replicate it nor do was I able to understand the documentation for it so I stopped pursuing this path.

Thank you in advance for your help!

Additional Info

I am using macOS and ATOM for the code.

The csv files can be found here!

github: https://github.com/teoyi/PROJECT-Automate-Research-Process

Trying out @zabop method

Code:

dflist = []
for key in files_of_interests:
    for filename in files_of_interests[key]:
        dflist.append(pd.read_csv(os.path.join(dir, filename)) )
concat = pd.concat(dflist, axis = 1)
concat.to_csv(dir + '/concat.csv')

Output:

enter image description here

Trying @SergeBallesta method

Code:

df = pd.concat([pd.read_csv(os.path.join(dir, filename))
                for key in files_of_interests for filename in files_of_interests[key]])

df = df.groupby(['Unnamed: 0', 'Wavelength', 'Wavelength.1']).mean().reset_index()
df.to_csv(dir + '/try.csv')
print(df)

Output:

enter image description here

6
  • I assume that the first column (which is name Unnamed: 0 here) consistently contains the number from 1 to 40. Do the Wavelength[1] also contain the exact same data across the different files? Commented Aug 1, 2020 at 7:38
  • What do you mean exactly by "append towards the right individually for each S2c and S2"? Commented Aug 1, 2020 at 7:41
  • @SergeBallesta Yes your assumption would be right but unnamed would go from 0 to 40, making that 41 rows instead. Sorry about that! And yes, Wavelength[1] will always go from 1100 to 1500 with increments of 10. Commented Aug 1, 2020 at 7:55
  • @zabop by that I meant as the for loop goes through each file, it will add columns to the data frame as such: S2c_1, S2_1, S2c_2, S2_2, ... Hope that clears up the confusion! Commented Aug 1, 2020 at 7:57
  • Thanks. Added my solution, maybe I misunderstood, let me know if there are issues. Commented Aug 1, 2020 at 8:11

3 Answers 3

1

IIUC you have:

  • a bunch of csv file, each containing a result from the same experiment
  • the first relevant column always contains numbers from 0 to 40 (so there are 41 lines per file)
  • the Wavelenght and Wavelength.1 columns always contain same values from 1100 to 1500 with a 10 increment
  • but additional columns may exist before the first relevant one
  • the first column has no name in the csv file, and up to the first relevant one names start with 'Unnamed: '

and you would like to get the average values of the S2 and S2c column for the same Wavelength value.

This can be done simply with groupby and mean, but we first have to filter out all the unnecessay columns. It can be made through the index_col and usecols parameter of read_csv:

...
print(files_of_interests)

# first concat the datasets:
dfs = [pd.read_csv(os.path.join(dir, filename), index_col=1,
                   usecols=lambda x: not x.startswith('Unnamed: '))
       for key in files_of_interests for filename in files_of_interests[key]]
df = pd.concat(dfs).reset_index()

# then take the averages
df = df.groupby(['Wavelength', 'Wavelength.1']).mean().reset_index()

# reorder columns and add 1 to the index to have it to run from 1 to 41
df = df.reindex(columns=['Wavelength', 'S2c', 'Wavelength.1', 'S2'])
df.index += 1

If there are still unwanted columns in resulting df, this magic command will help to identify the original files having a weird struct:

import pprint

pprint.pprint([df.columns for df in files])

With the files from github testfolder, it gives:

[Index(['Unnamed: 0', 'Wavelength', 'S2c', 'Wavelength.1', 'S2'], dtype='object'),
 Index(['Unnamed: 0', 'Wavelength', 'S2c', 'Wavelength.1', 'S2'], dtype='object'),
 Index(['Unnamed: 0', 'Wavelength', 'S2c', 'Wavelength.1', 'S2'], dtype='object'),
 Index(['Unnamed: 0', 'Wavelength', 'S2c', 'Wavelength.1', 'S2'], dtype='object'),
 Index(['Unnamed: 0', 'Unnamed: 0.1', 'Wavelength', 'S2c', 'Wavelength.1',
       'S2'],
      dtype='object'),
 Index(['Unnamed: 0', 'Wavelength', 'S2c', 'Wavelength.1', 'S2'], dtype='object')]

It makes clear that the fifth file as an additional columns.

Sign up to request clarification or add additional context in comments.

5 Comments

Hi! Thank you for getting back to me, I have given your code a try and have since changed key for the last for loop to files_of_interest[key] as leaving it as key seems to have failed to use the filename resulting in file not found error. I have also edited my post to show the code and the output. Not quite sure what happened or where it went wrong as the code made sense to me but the output turned really chaotic as it now spans [239 rows x 104 columns].
I have added the .csv files in my github repo if that helps make it easier to work with my question! The link can be found under Additional Information and the files can be found under testfolder if you are still interested in helping me!
The problem was in the input files. Garbage in, garbage out... But here it looks easy to filter out the unwanted columns
Thank you for pointing that out! That is most definitely a mistake on my part and it seems to have been due to me messing with the files multiple times. Just one question, what exactly does .reset_index() do ?
reset_index takes the current index (or multi-index) and makes it a normal column (or columns if multi). It creates a brand new index (range from 0 to size-1). I use it here because grouby.mean stores the grouping columns in the index.
1

If you have a list of dataframes, for example:

import pandas as pd
data = {'col_1': [3, 2, 1, 0], 'col_2': [3, 1, 2, 0]}
dflist = [pd.DataFrame.from_dict(data) for _ in range(5)]

You can do:

pd.concat(dflist,axis=1)

Which will look like:

enter image description here

If you want to append each column name with a number indicating which df they came from, before concat, do:

for index, df in enumerate(dflist):
    df.columns = [col+'_'+str(index) for col in df.columns]

Then pd.concat(dflist,axis=1), resulting:

enter image description here


While I can't reproduce your file system & confirm that this works, to create the dflist above from you files, something like this should work:

dflist = []
for key in files_of_interests:
    print(stack_df)
    for filename in files_of_interests[key]:
        dflist.append( pd.read_csv(os.path.join(dir, filename)) )
        

5 Comments

my dict is currently having a value that is a list of .csv files. Would it still work for the dflist you have (I would think I probably need another for loop for that, would that be correct?)? I think I understand what concat does now based on your answer so I will definitely give that route another try!
Yeah, it should work. I used dicts to make it reproducible, but you can create dflist any way you like.
(Added a way to produce dflist.)
So i fiddled around with it for a bit, I have edited my post to show my code + the output that I got when saved into csv format. That was definitely not what I expected compared to the examples you have given in your post!
I could have also probably saved it wrong but I will definitely revisit the code again tomorrow as it is currently 2AM. Do let me know of what you think though!
0

Turns out both @zabop and @SergeBallesta have provided me with valuable insights on to work on this issue through pandas.

What I wanted to have:

  1. The respective S2c and S2 columns of each file within the key:value pairs to be merged into one .csv file for further manipulation.

  2. Remove redundant columns to only show a single column of Wavelength that ranges from 1100 to 1500 with an increment of 10.

This requires the use of pd.concat which was introduced by @zabop and @SergeBallesta as shown below:

for key in files_of_interests:
    list = []
    for filename in files_of_interests[key]:
        list.append(pd.read_csv(os.path.join(dir,filename)))
        df = pd.concat(list, axis = 1)
        df = df.drop(['Unnamed: 0', 'Wavelength.1'], axis = 1)
        print(df)
        df.to_csv(os.path.join(dir + '/', f"{filename[:-5]}_master.csv"))

I had to use files_of_interests[key] for it to be able to read the filenames and have pd.read_csv to read the correct path. Other than that, I added axis = 1 to pd.concat which allows it to be concatenated horizontally along with for loops to access the filenames correctly. (I have double-checked the values and they do match up with the respective .csv files.)

The output to .csv looks like this:

enter image description here

The only issue now is that groupby as suggested by @SergeBallesta did not work as it returns ValueError: Grouper for 'Wavelength' not 1-dimensional. I will be creating a new question for this if I make no progress by the end of the day.

Once again, a big thank you to @zabop and @SergeBallesta for giving this a try though my explanation wasn't too clear, their answers have definitely provided me with the much-needed insight of how pandas work.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.