1

I have multiple csv files in a folder. The objective is to append the csv files into a single pd frame.

The question is how can we use pandas to concatenate all files in the folder, but at the same time associate specific keys with each of the pieces of the chopped up DataFrame using the keys argument: keys.

This means that we can now select out each chunk by key:

For example, Given two csv files in a folder, each of the csv have 3 column (A, B, C) and two rows.

CSV File: Book1

A0 B0 C0

A1 B1 C1

and

CSV File: Book2

A2 B2 C2

A3 B3 C3

The expected frames as shown in the figure.

enter image description here

Notice the index Book1 and Book2, on the left column. This name comes from the said csv file.

So far, I have the following code

# match the pattern ‘csv’ in the folder
extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]

But where under the following line of code I need to change to achieve the said objective?

combined_csv = pd.concat([pd.read_csv(f) for f in all_filenames ])

The reason why adding this keys is, to make easy access in the future. This usually can be achieve from

.loc['Book1']

3 Answers 3

1

you can add an extra column to each dataframe using assign method; this can be done after they are read and before concatenated

combined_csv = pd.concat([pd.read_csv(f).assign(name=f) for f in all_filenames ])

This will add name column with all values equal to file name f.

When all datasets are concatenated, you could set MultiIndex

combined_csv.reset_index(drop=True, inplace=True)

combined_csv.set_index([combined_csv.name, combined_csv.index], inplace=True)
Sign up to request clarification or add additional context in comments.

5 Comments

And thereafter: combined_csv.set_index("name")
Thanks for the quick response. But I prefer to instead to make the key index instead of creating another column, for each easy access using the 'loc' argument. As shown in this link. But I appreciate your time and suggestion
See the comment above, it addresses your particular need
Hi @SIA, may I know if there is other way instead of creating new column as you suggested?
I believe your goal is to create a MultiIndex dataframe. So one way or another you need to add the second level of index to your dataframe, adding a column and later setting it to index is one way I am aware of.
1

Find the code below:

import pandas as pd
dfs=[]
for f in all_filenames:
    df=pd.read_csv(f)
    df['index_name']=f.split('.')[0]
    dfs.append(df)
df_combined = pd.concat(dfs)
df_combined.set_index('index_name', inplace=True)

3 Comments

Hi, thanks for the quick reply. but I prefer to instead to make the key index instead of creating another column, for each easy access using the 'loc' argument. As shown in this link. But I appreciate your time and suggestion
With above code, you can use .loc function to fetch the data for a particular index.Right?
Yes, you are right, using: df_combined .loc[df_combined ['index_name']=='Book1'].
0

You could create a dataframe for each file, then add in which book it came from then append it to the combined_csv dataframe.

books = ['book1' 'book2',...'bookn']

i = 1

combined_csv = pd.DataFrame(columns=['Book', 'A', 'B', 'C'])

for book in books:
    data = pd.DataFrame('book{}.csv'.format(i))
    data.insert(0, 'Book', 'Book'.format(i))
    combined_csv = combined_csv.append(data, ignore_index=True)
    i += 1

combined_csv.set_index('Book', inplace=True)

Let me know if this helps?

2 Comments

Thanks for the quick response, but your suggestion does not answer the OP
See my edit, if this does not do what you want then feel free to ignore.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.