I have 703 tab seperated text files files of shape (X,4) where X can be any positive number with the largest value being 217632347. For example three of the files look like:
###File ID_739.txt
region latitude department product
NY 71 HR -
###File ID_618.txt
region latitude department product
LA 91 R&D -
###File ID_917.txt
region latitude department product
NY 71 HR
I want a dataframe (maybe pandas or numpy) which looks like:
region latitude ID_739 ID_618 ID_917
NY 71 1 0 1
LA 91 0 1 0
So in a way I am looking for one-hot encoding whereby I go put one under columns for which region and latitude is the same. For example ID_739 and ID_917 has the same region and latitude so they get a 1 and ID_618 gets a zero. I have 703 files which means my final dataframe will be of shape (X,705). It's 705 because each file becomes a column + region + latitude. How can I do that efficiently considering I have lots of lines in each text files? Insights will be appreciated.