0

I'm currently in a project and I need to work with a lot of CSV files, which are filled with data something like this:

    CSV1.csv

      A      B     C     D    ...
    1 1980  1     0.9   0.8
    2 2003  0.9   0.8   0.2
    3 1665  0.7   0.2   0.4
    4 1982  0.6   1     0.2
   ...

     CSV2.csv
      A      E     F     G   ...
    1 1665  1     0.4   1
    2 1980  0.4   0.8   0.6
    3 2003  0.1   0.3   0.9
    4 1982  0.3   1     0.6
   ...
  

All of the CSV files have the same values in the A column, but are disorganized. I am importing all the files like this:

path = r"/Users/.../folder/"
all_files = glob.glob(path + "/*.CSV")
all_csv = (pd.read_csv(f, sep=',') for f in all_files)
df_merged   = pd.concat(all_csv, axis=1, ignore_index=False) 

It gets merged, but the dataframe is disorganized.

This way is not correct to sort with df_merged.sort() because there is not a column with the same order at A. I know that I can manually import each one of the csv files and apply a .sort(), but those are 394 csv files...

I feel like something like this might be applicable in a large import of CSV files, but I don't know how to call a code line before the dataframe combination gets made (all_csv is a generator object).


P.S. at the end I execute to eliminate repeated A columns:

df_merged = df_merged.loc[:, ~df_merged.columns.duplicated()]

2 Answers 2

1

Instead of using concat, you should merge each dataframe together:

df = all_csv[0]
for csv in all_csv[1:]:
    df = df.merge(csv)

Output:

>>> df
      A    B    C    D    E    F    G
0  1980  1.0  0.9  0.8  0.4  0.8  0.6
1  2003  0.9  0.8  0.2  0.1  0.3  0.9
2  1665  0.7  0.2  0.4  1.0  0.4  1.0
3  1982  0.6  1.0  0.2  0.3  1.0  0.6

Note: you need to make all_csv a list instead of a generator:

all_csv = [pd.read_csv(f, sep=',') for f in all_files]
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks for your answer, I think it worked. In order to apply what you said I had to convert the all_csv from object generator to line, like this: all_csv = list(all_csv)
A better way to to that is instead to not create a generator at all. Use square brackets instead of parentheses when initializing all_csv like this: all_csv = [pd.read_csv(f, sep=',') for f in all_files]
0
  1. Alignment can be obtained by setting A as the index.

  2. Using a list of dataframes is not appealing as this can take a lot of memory.

    • solution 1: Build the other dataframes in the loop
    import pandas as pd
    
    path = r"/Users/.../folder/"
    all_files = glob.glob(path + "/*.CSV")
    
    df = pd.read_csv(all_files[0], sep=',').set_index('A')
    for f in all_files[1:]:
        dfs = pd.read_csv(f, sep=',').set_index('A')
        df = pd.concat([df, dfs], axis=1)
    
    • solution 2: Keep the generator and use functools.reduce
    import pandas as pd
    from functools import reduce
    
    path = r"/Users/.../folder/"
    all_files = glob.glob(path + "/*.CSV")
    
    def_gen = (pd.read_csv(io.StringIO(fn), sep='\s+').set_index('A') for fn in all_files)
    df = reduce(lambda df, d: pd.concat([df, d], axis=1), def_gen)
    

    df:

            B    C    D    E    F    G
    A                                 
    1665  0.7  0.2  0.4  1.0  0.4  1.0
    1980  1.0  0.9  0.8  0.4  0.8  0.6
    1982  0.6  1.0  0.2  0.3  1.0  0.6
    2003  0.9  0.8  0.2  0.1  0.3  0.9
    

Personally, I would take the easy path ("solution 1") and add some logging to identify where there will be an import error. Because in real world data is rarely clean and well formatted.

3 Comments

Thank you, I like what you are doing... but in the first option, how could I select for set_index('A') the first column if it may not have always the same name?
@Juank, to concatenate, join, merge etc. you need to have at least a key. either the key is explicit (has a name) or at least the key is implicit (first 'field '). Can you provide a subset (or mockup) of your data,, as it seems that CSV1.csv and CSV2.csv do not match exactly your data (i.e. they ave an common field named 'A') ans confirm that you whant to align on the first field.
@Juank, Yes the code works. But frankly, I would take the easy path ("solution 1") and add some logging to identify where there will be an import error. Because in real world data is rarely clean and well formatted..

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.