0

I'm reading a number of csv files into python using a glob matching and would like to add the filename as a column in each of the dataframes. I'm currently matching on a pattern and then using a generator to read in the files as so:

base_list_of_files = glob.glob(matching_pattern)

loaded_csv_data_frames = (pd.read_csv(csv, encoding= 'latin-1') for csv in base_list_of_files)    

for idx, df in enumerate(loaded_csv_data_frames):

    df['file_origin'] = base_list_of_files[idx]

combined_data = pd.concat(loaded_csv_data_frames)

I however get the error ValueError: No objects to concatenate when I come to do the concatenation - why does the adding the column iteratively break the list of dataframes ?

1
  • concat needs list of DF to concat, here you are passing only one. Secondly, the second-line, it is not accumulating, and will only have the last csv in it Commented Oct 21, 2022 at 12:36

1 Answer 1

1

Generators can only go through one iteration, at the end of which they throw a StopIteration exception which is automatically handled by the for loop. If you try to consume them again they will just raise StopIteration, as demonstrated here:

def consume(gen):
    while True:
        print(next(gen))
    except StopIteration:
        print("Stop iteration")
        break
>>> gen = (i for i in range(2))
>>> consume(gen)
0
1
Stop iteration
>>> consume(gen)
Stop iteration

That's why you get the ValueError when you try to use loaded_csv_data_frames for a second time.

I cannot replicate your example, but here it is something that should be similar enough:

df1 = pd.DataFrame(0, columns=["a", "b"], index=[0, 1])
df2 = pd.DataFrame(1, columns=["a", "b"], index=[0, 1])
loaded_csv_data_frames = iter((df1, df2))  # Pretend that these are read from a csv file
base_list_of_files = iter(("df1.csv", "df2.csv"))  # Pretend these file names come from glob

You can add the file of origin as a key when you concatenate. Add names too to give titles to your index levels.

>>> df = pd.concat(loaded_csv_data_frames, keys=base_list_of_files, names=["file_origin", "index"])
>>> df
                  a   b
file_origin index       
df1.csv     0     0   0
            1     0   0
df2.csv     0     1   1
            1     1   1

If you want file_origin to be one of your columns, just reset first level of the index.

>>> df.reset_index("file_origin")
    file_origin a   b
index           
0   df1.csv     0   0
1   df1.csv     0   0
0   df2.csv     1   1
1   df2.csv     1   1
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.