I've got a large .csv file (5GB) from UK land registry. I need to find all real estate that has been bought/sold two or more times.
Each row of the table looks like this:
{F887F88E-7D15-4415-804E-52EAC2F10958},"70000","1995-07-07 00:00","MK15 9HP","D","N","F","31","","ALDRICH DRIVE","WILLEN","MILTON KEYNES","MILTON KEYNES","MILTON KEYNES","A","A"
I've never used pandas or any data science library. So far I've come up with this plan:
Load the .csv file and add headers and column names
Drop unnecessary columns
Create hashmap of edited df and find duplicates
Export duplicates to a new .csv file
From my research I found that pandas are bad with very big files so I used dask
df = dd.read_csv('pp-complete.csv', header=None, dtype={7: 'object', 8: 'object'}).astype(str)
df.columns = ['ID', 'Price', 'Date', 'ZIP', 'PropType', 'Old/new', 'Duration', 'Padress', 'Sadress', 'Str', 'Locality', 'Town', 'District', 'County', 'PPDType', 'Rec_Stat']
df.head()
- After I tried to delete unnecessary columns
df.drop('ID', axis=1).head()
also tried
indexes_to_remove = [0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16]
for index in indexes_to_remove:
df.drop(df.index[index], axis=1)
Nothing worked.
The task is to show the property that has been bought/sold two or more times. I decided to use only address columns because every other column's data isn't consistent (ID - is unique code of transaction, Date, type of offer etc.)
I need to do this task with minimum memory and CPU usage that's why I went with hashmap.
I don't know if there's another method to do this easier or more efficient.