creating a pandas dataframe iteratively

Question

I have 703 tab seperated text files files of shape (X,4) where X can be any positive number with the largest value being 217632347. For example three of the files look like:

###File ID_739.txt
region   latitude    department      product
  NY        71           HR             -

###File ID_618.txt
region   latitude    department      product
  LA        91           R&D            -

###File ID_917.txt
region   latitude    department      product
  NY        71           HR

I want a dataframe (maybe pandas or numpy) which looks like:


region    latitude      ID_739     ID_618        ID_917
  NY         71            1           0            1
  LA         91            0           1            0

So in a way I am looking for one-hot encoding whereby I go put one under columns for which region and latitude is the same. For example ID_739 and ID_917 has the same region and latitude so they get a 1 and ID_618 gets a zero. I have 703 files which means my final dataframe will be of shape (X,705). It's 705 because each file becomes a column + region + latitude. How can I do that efficiently considering I have lots of lines in each text files? Insights will be appreciated.

can you check if this also works for you?

mozway
– mozway

2021-08-23 07:59:58 +00:00
Commented Aug 23, 2021 at 7:59 — mozway
– mozway, Commented Aug 23, 2021 at 7:59

jezrael · Accepted Answer · 2021-08-23 10:13:24Z

2

First create big DataFrame with column New by filenames, then aggregate join, so possible use one-hot encoding by Series.str.get_dummies:

import glob

files = glob.glob('files/*.txt')
dfs = [pd.read_csv(fp).assign(New=os.path.basename(fp).split('.')[0]) for fp in files]


df = (pd.concat(dfs, ignore_index=True)
        .groupby(['region','latitude'])['New']
        .agg('|'.join)
        .str.get_dummies()
        .reset_index())
print (df)
  region  latitude  ID_618  ID_739  ID_917
0     LA        91       1       0       0
1     NY        71       0       1       1

edited Aug 23, 2021 at 10:13

answered Aug 23, 2021 at 7:35

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

John Over a year ago

Can you briefly explain what you are doing in the last line?

jezrael Over a year ago

@John - added some explanation

mozway · Accepted Answer · 2021-08-23 07:58:07Z

0

Assuming you files are in the current directory, you can use glob to read your files and merge with the dummies obtained from the "latitude" column:

from glob import glob
files = glob('ID_*.txt')
df = pd.concat({f[:-4]: pd.read_csv(f, sep='\s+') for f in files}).droplevel(1)
(df[['region', 'latitude']].merge(pd.get_dummies(df['latitude']).T,
                                  left_on='latitude',
                                  right_index=True,
                                 )
                           .drop_duplicates('latitude')
)

output:

       region  latitude  ID_739  ID_917  ID_618
ID_739     NY        71       1       1       0
ID_618     LA        91       0       0       1

edited Aug 23, 2021 at 7:58

answered Aug 23, 2021 at 7:41

mozway

267k13 gold badges56 silver badges106 bronze badges

4 Comments

jezrael Over a year ago

crosstab dont create one-hot encoding

mozway Over a year ago

@jezrael I don't understand your comment, this works for me

jezrael Over a year ago

yes, becuase sample data. crostab is used for counts, not for one-hot encoding

mozway Over a year ago

@jezrael I had missed a point in the question, thanks. Should be fixed now

Collectives™ on Stack Overflow

creating a pandas dataframe iteratively

2 Answers 2

2 Comments

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related