1

I have a large csv file, and suppose that it looks like this

ID,PostCode,Value
H1A0A1-00,H1A0A1,0
H1A0A1-01,H1A0A1,0
H1A0A1-02,H1A0A1,0
H1A0A1-03,H1A0A1,0
H1A0A1-04,H1A0A1,1
H1A0A1-05,H1A0A1,0
H1A1G7-0,H1A1G7,0
H1A1G7-1,H1A1G7,0
H1A1G7-2,H1A1G7,0
H1A1N6-00,H1A1N6,0
H1A1N6-01,H1A1N6,0
H1A1N6-02,H1A1N6,0
H1A1N6-03,H1A1N6,0
H1A1N6-04,H1A1N6,0
H1A1N6-05,H1A1N6,0
...

I want to split it up by PostCode values and save all rows with the same postal code as a CSV. I have tried

postals = data['PostCode'].unique()
for p in postals:
    df = data[data['PostCode'] == p]
    df.to_csv(directory + '/output/demographics/' + p + '.csv', header=False, index=False)

Is there a way to do this using Dask to leverage multiprocessing? Thanks

0

1 Answer 1

5

In case you want to save to parquet it's pretty easy

Parquet

import dask.dataframe as dd
import pandas as pd
import os 

fldr = 'data_pq'
data.to_parquet(fldr, partition_on="PostCode")

this save data for every postcode inside a folder called PostCode=xxxxxxx which cointains as many file as the number of partitions of your dask.dataframe.

CSV

Here I suggest you to use a custom function write_file.

import dask.dataframe as dd
import pandas as pd
import os 

fldr = "data_csv"
os.makedirs(fldr, exist_ok=True)


def write_file(grp):
    pc = grp["PostCode"].unique()[0]
    grp.to_csv(f"{fldr}/{pc}.csv",
               header=False,
               index=False)
    return None


data.groupby("PostCode")\
    .apply(write_file, meta=('x', 'f8'))\
    .compute()

# the same function works for pandas df too
# data.groupby("PostCode").apply(write_file)

You should check how it works performancewise and eventually play with scheduler.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.