adding columns to dataframe based on file name in python

Question

I have several txt files with different file names. I would like to do two things:

1.) Load the data all at once 2.) using partial parts from the file name and add it for the dedicated dataframe as additional column 3.) adding the files together

I have below a really really manual example but want to automize it somehow. How is that possible?

The code looks like the following

import pandas as pd 

#load data files
data1 = pd.read_csv('C:/file1_USA_Car_1d.txt')
data2 = pd.read_csv('C:/file2_USA_Car_2d.txt')
data3 = pd.read_csv('C:/file3_USA_Car_1m.txt')
data4 = pd.read_csv('C:/file3_USA_Car_6m.txt')
data5 = pd.read_csv('C:file3_USA_Car_1Y.txt')

df = pd.DataFrame()

print(df)

df = data1

#--> The input for the column below should be taken from the name of the file 
df['country'] = 'USA'
df['Type'] = 'Car'
df['duration'] = '1d'

print(df)

MCM · Accepted Answer · 2018-08-15 08:08:13Z

1

Iterate over your files with glob and do some simple splitting on the filenames.

import glob
import pandas as pd

df_list = []
for file in glob.glob('C:/file1_*_*_*.txt'):
    # Tweak this to work for your actual filepaths, if needed.
    country, typ, dur = file.split('.')[0].split('_')[1:]  
    df = (pd.read_csv(file)
            .assign(Country=country, Type=typ, duration=dur))
    df_list.append(df)

df = pd.concat(df_list)

edited Aug 15, 2018 at 8:08

MCM

1,5113 gold badges19 silver badges22 bronze badges

answered Aug 14, 2018 at 16:14

cs95

406k106 gold badges744 silver badges797 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

MCM Over a year ago

thanks this is the way i was looking for. but somehow i receive the folloiwng error message: AttributeError: 'DataFrame' object has no attribute 'df'

MCM Over a year ago

thanks a lot. I updated the code including import pandas.

Steven · Accepted Answer · 2018-08-14 16:09:53Z

0

One way of doing it would be to do this:

all_res = pd.DataFrame()
file_list = ['C:/file1_USA_Car_1d.txt', 'C:/file3_USA_Car_1m.txt', 'etc']
for file_name in file_list:
    tmp = pd.read_csv(file_name)
    tmp['file_name'] = file_name
    all_res = all_res.append(tmp)

all_res = all_res.reset_index()

answered Aug 14, 2018 at 16:09

Steven

8541 gold badge8 silver badges23 bronze badges

1 Comment

MCM Over a year ago

thanks a lot for the answer. This uses the full string, but I just want partial parts of it. Is that also possible?

PMende · Accepted Answer · 2018-08-14 16:27:49Z

0

I'd do something like the following:

from pathlib import Path
from operator import itemgetter

import pandas as pd

file_paths = [
    Path(path_str)
    for path_str in (
        'C:/file1_USA_Car_1d.txt', 'C:/file2_USA_Car_2d.txt',
        'C:/file3_USA_Car_1m.txt', 'C:/file3_USA_Car_6m.txt',
        'C:file3_USA_Car_1Y.txt')
]

def import_csv(csv_path):
    df = pd.read_csv(csv_path)
    df['country'], df['Type'], df['duration'] = itemgetter(1, 2, 3)(csv_path.stem.split('_'))
    return df

dfs = [import_csv(csv_path) for csv_path in file_paths]

This helps encapsulate your desired behavior in a helper function and reduces the things you need to think about.

answered Aug 14, 2018 at 16:27

PMende

5,4903 gold badges21 silver badges26 bronze badges

Collectives™ on Stack Overflow

adding columns to dataframe based on file name in python

3 Answers 3

2 Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related