Operating large .csv file with pandas/dask Python

Question

I've got a large .csv file (5GB) from UK land registry. I need to find all real estate that has been bought/sold two or more times.

Each row of the table looks like this:

{F887F88E-7D15-4415-804E-52EAC2F10958},"70000","1995-07-07 00:00","MK15 9HP","D","N","F","31","","ALDRICH DRIVE","WILLEN","MILTON KEYNES","MILTON KEYNES","MILTON KEYNES","A","A"

I've never used pandas or any data science library. So far I've come up with this plan:

Load the .csv file and add headers and column names
Drop unnecessary columns
Create hashmap of edited df and find duplicates
Export duplicates to a new .csv file
From my research I found that pandas are bad with very big files so I used dask

df = dd.read_csv('pp-complete.csv', header=None, dtype={7: 'object', 8: 'object'}).astype(str)
df.columns = ['ID', 'Price', 'Date', 'ZIP', 'PropType', 'Old/new', 'Duration', 'Padress', 'Sadress', 'Str', 'Locality', 'Town', 'District', 'County', 'PPDType', 'Rec_Stat']
df.head()

After I tried to delete unnecessary columns

df.drop('ID', axis=1).head()

also tried

indexes_to_remove = [0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16]
for index in indexes_to_remove:
    df.drop(df.index[index], axis=1)

Nothing worked.

The task is to show the property that has been bought/sold two or more times. I decided to use only address columns because every other column's data isn't consistent (ID - is unique code of transaction, Date, type of offer etc.)

I need to do this task with minimum memory and CPU usage that's why I went with hashmap.

I don't know if there's another method to do this easier or more efficient.

In general, use pandas or dask implementations of common routines such as finding/dropping duplicates rather than trying to implement your own. Much of pandas is written in or calls libaries written in Cython and will perform much better than anything you can write in pure python. The pandas guide to scaling to large datasets has some great tips — Michael Delgado
– Michael Delgado, Commented Mar 26, 2022 at 22:13

SultanOrazbayev · Accepted Answer · 2022-03-27 05:17:12Z

2

Some minor suggestions:

if 5GB is the full dataset, it's best to use plain pandas. The strategy you outlined might involve communication across partitions, so it's going to be computationally more expensive (or will require some work to make it more efficient). With pandas all the data will be in memory, so sorting/duplication check will be fast.
In the code, make sure to assign the modified dataframe. Typically the modification is assigned to replace the existing dataframe:

# without "df = " part, the modification is not stored
df = df.drop(columns=['ID'])

If memory is a big constraint, then consider loading only the data you need (as opposed to loading everything and then dropping specific columns). For this we will need to provide the list of columns to usecols kwarg of pd.read_csv. Here's the rough idea:

column_names = ['ID', 'Price', 'Date', 'ZIP', 'PropType', 'Old/new', 'Duration', 'Padress', 'Sadress', 'Str', 'Locality', 'Town', 'District', 'County', 'PPDType', 'Rec_Stat']
indexes_to_remove = [0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16]
indexes_to_keep = [i for i in range(len(column_names)) if i not in indexes_to_remove]
column_names_to_keep = [n for i,n in enumerate(column_names) if i in indexes_to_keep]

df = pd.read_csv('some_file.csv', header=column_names_to_keep, usecols=indexes_to_keep)

edited Mar 27, 2022 at 5:17

answered Mar 26, 2022 at 15:37

SultanOrazbayev

16.7k3 gold badges24 silver badges59 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Michael Delgado Over a year ago

@CMDR_Mark you can check the pandas docs for a function to see whether it operates "in place," modifying the calling dataframe, or if it returns a copy of the dataframe with modified attributes. many pandas functions have an inplace argument, and this is often False by default. For example, see df.drop.

Collectives™ on Stack Overflow

Operating large .csv file with pandas/dask Python

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related