How can I sort and concat a csv file in a dataframe

Question

I'm currently in a project and I need to work with a lot of CSV files, which are filled with data something like this:

    CSV1.csv

      A      B     C     D    ...
    1 1980  1     0.9   0.8
    2 2003  0.9   0.8   0.2
    3 1665  0.7   0.2   0.4
    4 1982  0.6   1     0.2
   ...

     CSV2.csv
      A      E     F     G   ...
    1 1665  1     0.4   1
    2 1980  0.4   0.8   0.6
    3 2003  0.1   0.3   0.9
    4 1982  0.3   1     0.6
   ...

All of the CSV files have the same values in the A column, but are disorganized. I am importing all the files like this:

path = r"/Users/.../folder/"
all_files = glob.glob(path + "/*.CSV")
all_csv = (pd.read_csv(f, sep=',') for f in all_files)
df_merged   = pd.concat(all_csv, axis=1, ignore_index=False)

It gets merged, but the dataframe is disorganized.

This way is not correct to sort with df_merged.sort() because there is not a column with the same order at A. I know that I can manually import each one of the csv files and apply a .sort(), but those are 394 csv files...

I feel like something like this might be applicable in a large import of CSV files, but I don't know how to call a code line before the dataframe combination gets made (all_csv is a generator object).

P.S. at the end I execute to eliminate repeated A columns:

df_merged = df_merged.loc[:, ~df_merged.columns.duplicated()]

score 1 · Accepted Answer · 2021-12-11 18:13:25Z

1

Instead of using concat, you should merge each dataframe together:

df = all_csv[0]
for csv in all_csv[1:]:
    df = df.merge(csv)

Output:

>>> df
      A    B    C    D    E    F    G
0  1980  1.0  0.9  0.8  0.4  0.8  0.6
1  2003  0.9  0.8  0.2  0.1  0.3  0.9
2  1665  0.7  0.2  0.4  1.0  0.4  1.0
3  1982  0.6  1.0  0.2  0.3  1.0  0.6

Note: you need to make all_csv a list instead of a generator:

all_csv = [pd.read_csv(f, sep=',') for f in all_files]

edited Dec 11, 2021 at 18:13

answered Dec 11, 2021 at 17:33

user17242583

Sign up to request clarification or add additional context in comments.

2 Comments

Juank Over a year ago

Thanks for your answer, I think it worked. In order to apply what you said I had to convert the all_csv from object generator to line, like this: all_csv = list(all_csv)

user17242583 Over a year ago

A better way to to that is instead to not create a generator at all. Use square brackets instead of parentheses when initializing all_csv like this: all_csv = [pd.read_csv(f, sep=',') for f in all_files]

hpchavaz · Accepted Answer · 2021-12-12 10:43:30Z

0

Alignment can be obtained by setting A as the index.

Using a list of dataframes is not appealing as this can take a lot of memory.

solution 1: Build the other dataframes in the loop

import pandas as pd

path = r"/Users/.../folder/"
all_files = glob.glob(path + "/*.CSV")

df = pd.read_csv(all_files[0], sep=',').set_index('A')
for f in all_files[1:]:
    dfs = pd.read_csv(f, sep=',').set_index('A')
    df = pd.concat([df, dfs], axis=1)

solution 2: Keep the generator and use functools.reduce

import pandas as pd
from functools import reduce

path = r"/Users/.../folder/"
all_files = glob.glob(path + "/*.CSV")

def_gen = (pd.read_csv(io.StringIO(fn), sep='\s+').set_index('A') for fn in all_files)
df = reduce(lambda df, d: pd.concat([df, d], axis=1), def_gen)

df:

        B    C    D    E    F    G
A                                 
1665  0.7  0.2  0.4  1.0  0.4  1.0
1980  1.0  0.9  0.8  0.4  0.8  0.6
1982  0.6  1.0  0.2  0.3  1.0  0.6
2003  0.9  0.8  0.2  0.1  0.3  0.9

Personally, I would take the easy path ("solution 1") and add some logging to identify where there will be an import error. Because in real world data is rarely clean and well formatted.

edited Dec 12, 2021 at 10:43

answered Dec 11, 2021 at 20:05

hpchavaz

1,38811 silver badges17 bronze badges

3 Comments

Juank Over a year ago

Thank you, I like what you are doing... but in the first option, how could I select for set_index('A') the first column if it may not have always the same name?

hpchavaz Over a year ago

@Juank, to concatenate, join, merge etc. you need to have at least a key. either the key is explicit (has a name) or at least the key is implicit (first 'field '). Can you provide a subset (or mockup) of your data,, as it seems that CSV1.csv and CSV2.csv do not match exactly your data (i.e. they ave an common field named 'A') ans confirm that you whant to align on the first field.

hpchavaz Over a year ago

@Juank, Yes the code works. But frankly, I would take the easy path ("solution 1") and add some logging to identify where there will be an import error. Because in real world data is rarely clean and well formatted..

Collectives™ on Stack Overflow

How can I sort and concat a csv file in a dataframe

2 Answers 2

2 Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related