More efficient Iteration in Pandas and Python

Question

I have hundreds of text files with thousands of records that look like this (6 GB of weather data from NOAA): '0175690150931212019010100567+34300-116166FM-15+06...' Each position in the string has unique info and I separate it into columns with the function below:

def convert_to_df(path):
    df = pd.read_csv(path, low_memory= False, header=None, names= ['Data'])
    df_clean = pd.DataFrame(columns = ['TVC','USAF','WBAN','DATE','TIME','SOURCE','LAT','LONG','TYPE','ELEV','FWSID',
                                       'MPOQC','WIND_ANGLE',...)
    for record, data in enumerate(df['Data']):
        TVC = df['Data'][record][0:4]
        USAF = df['Data'][record][4:10]
        WBAN = df['Data'][record][10:15]
        DATE = df['Data'][record][15:23]
        TIME = df['Data'][record][23:27]
        ...
        ATM_PRESSURE = df['Data'][record][99:104]
        ATM_PRESSURE_QC = df['Data'][record][104:105]
        
        clean_dict = {'TVC':TVC,'USAF':USAF,'WBAN':WBAN,'DATE':DATE,
              'TIME':TIME,'SOURCE':SOURCE,'LAT':LAT,'LONG':LONG,'TYPE':TYPE,'ELEV':ELEV,...}
        df_clean = df_clean.append(clean_dict, ignore_index = True)

It ends up looking something like this, though much more rows and 31 columns:

TVC	USAF	WBAN	DATE	TIME
0175	690150	93121	20190101	0056
0175	690150	93121	20190101	0156

I am using multiprocessing but it takes quite sometime to get through everything (20 hours or so). Just curious if there is a more efficient/ faster way to separate out the positional data in each string into columns.

ddejohn · Accepted Answer · 2021-09-06 05:27:38Z

3

You can perform string slicing on the entire initial DataFrame at once:

TVC = df["Data"].str[0:4]
USAF = df["Data"].str[4:10]
WBAN = df["Data"].str[10:15]
.
.
.

This is tedious, so you could take your column names and a list of slice boundaries:

columns = ['TVC', 'USAF', 'WBAN', 'DATE', 'TIME', 'SOURCE', ...]
slices = [0, 4, 10, 15, 23, 27, ...]
slices = zip(slices, slices[1:])  # (0, 4), (4, 10), (10, 15), ...

clean_dict = {}
for key, (start, end) in zip(columns, slices):
    clean_dict[key] = df["Data"].str[start:end]

edited Sep 6, 2021 at 5:27

answered Sep 6, 2021 at 4:44

ddejohn

9,0043 gold badges21 silver badges31 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

ddejohn Over a year ago

@Thomas, I'd love to hear about what kind of performance increase this ends up giving you. 20 hours is atrocious.

Thomas Over a year ago

It ended up taking about 30 seconds to a minute.

ddejohn Over a year ago

That's great to hear. Glad to help.

Collectives™ on Stack Overflow

More efficient Iteration in Pandas and Python

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related