1

I have hundreds of text files with thousands of records that look like this (6 GB of weather data from NOAA): '0175690150931212019010100567+34300-116166FM-15+06...' Each position in the string has unique info and I separate it into columns with the function below:

def convert_to_df(path):
    df = pd.read_csv(path, low_memory= False, header=None, names= ['Data'])
    df_clean = pd.DataFrame(columns = ['TVC','USAF','WBAN','DATE','TIME','SOURCE','LAT','LONG','TYPE','ELEV','FWSID',
                                       'MPOQC','WIND_ANGLE',...)
    for record, data in enumerate(df['Data']):
        TVC = df['Data'][record][0:4]
        USAF = df['Data'][record][4:10]
        WBAN = df['Data'][record][10:15]
        DATE = df['Data'][record][15:23]
        TIME = df['Data'][record][23:27]
        ...
        ATM_PRESSURE = df['Data'][record][99:104]
        ATM_PRESSURE_QC = df['Data'][record][104:105]
        
        clean_dict = {'TVC':TVC,'USAF':USAF,'WBAN':WBAN,'DATE':DATE,
              'TIME':TIME,'SOURCE':SOURCE,'LAT':LAT,'LONG':LONG,'TYPE':TYPE,'ELEV':ELEV,...}
        df_clean = df_clean.append(clean_dict, ignore_index = True)

It ends up looking something like this, though much more rows and 31 columns:

TVC USAF WBAN DATE TIME
0175 690150 93121 20190101 0056
0175 690150 93121 20190101 0156

I am using multiprocessing but it takes quite sometime to get through everything (20 hours or so). Just curious if there is a more efficient/ faster way to separate out the positional data in each string into columns.

1 Answer 1

3

You can perform string slicing on the entire initial DataFrame at once:

TVC = df["Data"].str[0:4]
USAF = df["Data"].str[4:10]
WBAN = df["Data"].str[10:15]
.
.
.

This is tedious, so you could take your column names and a list of slice boundaries:

columns = ['TVC', 'USAF', 'WBAN', 'DATE', 'TIME', 'SOURCE', ...]
slices = [0, 4, 10, 15, 23, 27, ...]
slices = zip(slices, slices[1:])  # (0, 4), (4, 10), (10, 15), ...

clean_dict = {}
for key, (start, end) in zip(columns, slices):
    clean_dict[key] = df["Data"].str[start:end]
Sign up to request clarification or add additional context in comments.

3 Comments

@Thomas, I'd love to hear about what kind of performance increase this ends up giving you. 20 hours is atrocious.
It ended up taking about 30 seconds to a minute.
That's great to hear. Glad to help.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.