1

I have 703 tab seperated text files files of shape (X,4) where X can be any positive number with the largest value being 217632347. For example three of the files look like:

###File ID_739.txt
region   latitude    department      product
  NY        71           HR             -

###File ID_618.txt
region   latitude    department      product
  LA        91           R&D            -

###File ID_917.txt
region   latitude    department      product
  NY        71           HR

I want a dataframe (maybe pandas or numpy) which looks like:


region    latitude      ID_739     ID_618        ID_917
  NY         71            1           0            1
  LA         91            0           1            0

So in a way I am looking for one-hot encoding whereby I go put one under columns for which region and latitude is the same. For example ID_739 and ID_917 has the same region and latitude so they get a 1 and ID_618 gets a zero. I have 703 files which means my final dataframe will be of shape (X,705). It's 705 because each file becomes a column + region + latitude. How can I do that efficiently considering I have lots of lines in each text files? Insights will be appreciated.

1
  • can you check if this also works for you? Commented Aug 23, 2021 at 7:59

2 Answers 2

2

First create big DataFrame with column New by filenames, then aggregate join, so possible use one-hot encoding by Series.str.get_dummies:

import glob

files = glob.glob('files/*.txt')
dfs = [pd.read_csv(fp).assign(New=os.path.basename(fp).split('.')[0]) for fp in files]


df = (pd.concat(dfs, ignore_index=True)
        .groupby(['region','latitude'])['New']
        .agg('|'.join)
        .str.get_dummies()
        .reset_index())
print (df)
  region  latitude  ID_618  ID_739  ID_917
0     LA        91       1       0       0
1     NY        71       0       1       1
Sign up to request clarification or add additional context in comments.

2 Comments

Can you briefly explain what you are doing in the last line?
@John - added some explanation
0

Assuming you files are in the current directory, you can use glob to read your files and merge with the dummies obtained from the "latitude" column:

from glob import glob
files = glob('ID_*.txt')
df = pd.concat({f[:-4]: pd.read_csv(f, sep='\s+') for f in files}).droplevel(1)
(df[['region', 'latitude']].merge(pd.get_dummies(df['latitude']).T,
                                  left_on='latitude',
                                  right_index=True,
                                 )
                           .drop_duplicates('latitude')
)

output:

       region  latitude  ID_739  ID_917  ID_618
ID_739     NY        71       1       1       0
ID_618     LA        91       0       0       1

4 Comments

crosstab dont create one-hot encoding
@jezrael I don't understand your comment, this works for me
yes, becuase sample data. crostab is used for counts, not for one-hot encoding
@jezrael I had missed a point in the question, thanks. Should be fixed now

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.