1

I have a dataset containing historical pricing data for certain number of client IDs. Essentially, it is a spreadsheet with two columns as primary keys (id, p_date): enter image description here Please find below a snippet (first 4 rows) of the dataset:

{'id': {0: '038af19179925da21a25619c5a24b745',
1: '038af19179925da21a25619c5a24b745',
2: '038af19179925da21a25619c5a24b745',
3: '038af19179925da21a25619c5a24b745',
4: '038af19179925da21a25619c5a24b745'},
'p_date': {0: '2015-01-01',
1: '2015-02-01',
2: '2015-03-01',
3: '2015-04-01',
4: '2015-05-01'},
'p_p1_fix': {0: 44.26693116,
1: 44.26693116,
2: 44.26693116,
3: 44.26693116,
4: 44.26693116},
'p_p1_var': {0: 0.151367,
1: 0.151367,
2: 0.151367,
3: 0.149626,
4: 0.149626},
'p_p2_fix': {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0},
'p_p2_var': {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0},
'p_p3_fix': {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0},
'p_p3_var': {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0}}

There are up to 12 data points per client id corresponding to 12 months of a year. I would like to use the individual pricing data points as separate features in a classification model so need to turn them into separate columns. How to do this most efficiently? The spreadsheet contains data for +20K users. I have written the script below which is taking more than an hour and still hasn't completed:

def transform_hist (hist):
  dates = pd.Series(['2015-01-01',
                         '2015-02-01',
                         '2015-03-01',
                         '2015-04-01',
                         '2015-05-01',
                         '2015-06-01',
                         '2015-07-01',
                         '2015-08-01',
                         '2015-09-01',
                         '2015-10-01',
                         '2015-11-01',
                         '2015-12-01',])
  ids = hist['id'].unique()
  rows_list = []
  for curr_id in ids:
    for curr_date in dates:
        temp = hist[(hist.id == curr_id) & (hist.p_date == curr_date)]
        if len(temp > 0):
            rows_list.append({'id':curr_id, 
                         'p1_var_'+curr_date[5:7]:temp['p_p1_var'],
                         'p2_var_'+curr_date[5:7]:temp['p_p2_var'],
                         'p3_var_'+curr_date[5:7]:temp['p_p3_var'],
                         'p1_fix_'+curr_date[5:7]:temp['p_p1_fix'],
                         'p2_fix_'+curr_date[5:7]:temp['p_p2_fix'],
                         'p3_fix_'+curr_date[5:7]:temp['p_p3_fix']
                        })
  df = pd.DataFrame(rows_list)
  flat_df = df.groupby(['id']).sum()
  return flat_df.reset_index()

The expected output would be a dataframe with 73 columns (id + 12*6 columns each storing one of the 6 prices for each of the 12 data points) and with 1 row per id (currently there are up to 12 rows per id). I'm adding below an example of the expected output for the dataframe snippet above:

{'id': {0: '0002203ffbb812588b632b9e628cc38d'},
'p_p1_fix_01': {0: 40.56596939999999},
'p_p1_fix_02': {0: 40.56596939999999},
'p_p1_fix_03': {0: 40.728885},
'p_p1_fix_04': {0: 40.728885},
'p_p1_fix_05': {0: nan},
'p_p1_fix_06': {0: nan},
'p_p1_fix_07': {0: nan},
'p_p1_fix_08': {0: nan},
'p_p1_fix_09': {0: nan},
'p_p1_fix_10': {0: nan},
'p_p1_fix_11': {0: nan},
'p_p1_fix_12': {0: nan},
'p_p1_var_01': {0: 0.12609800000000002},
'p_p1_var_02': {0: 0.12609800000000002},
'p_p1_var_03': {0: 0.12806700000000001},
'p_p1_var_04': {0: 0.12806700000000001},
'p_p1_var_05': {0: nan},
'p_p1_var_06': {0: nan},
'p_p1_var_07': {0: nan},
'p_p1_var_08': {0: nan},
'p_p1_var_09': {0: nan},
'p_p1_var_10': {0: nan},
'p_p1_var_11': {0: nan},
'p_p1_var_12': {0: nan},
'p_p2_fix_01': {0: 24.33958068},
'p_p2_fix_02': {0: 24.33958068},
'p_p2_fix_03': {0: 24.43733004},
'p_p2_fix_04': {0: 24.43733004},
'p_p2_fix_05': {0: nan},
'p_p2_fix_06': {0: nan},
'p_p2_fix_07': {0: nan},
'p_p2_fix_08': {0: nan},
'p_p2_fix_09': {0: nan},
'p_p2_fix_10': {0: nan},
'p_p2_fix_11': {0: nan},
'p_p2_fix_12': {0: nan},
'p_p2_var_01': {0: 0.103975},
'p_p2_var_02': {0: 0.103975},
'p_p2_var_03': {0: 0.105842},
'p_p2_var_04': {0: 0.105842},
'p_p2_var_05': {0: nan},
'p_p2_var_06': {0: nan},
'p_p2_var_07': {0: nan},
'p_p2_var_08': {0: nan},
'p_p2_var_09': {0: nan},
'p_p2_var_10': {0: nan},
'p_p2_var_11': {0: nan},
'p_p2_var_12': {0: nan},
'p_p3_fix_01': {0: 16.22638872},
'p_p3_fix_02': {0: 16.22638872},
'p_p3_fix_03': {0: 16.29155496},
'p_p3_fix_04': {0: 16.29155496},
'p_p3_fix_05': {0: nan},
'p_p3_fix_06': {0: nan},
'p_p3_fix_07': {0: nan},
'p_p3_fix_08': {0: nan},
'p_p3_fix_09': {0: nan},
'p_p3_fix_10': {0: nan},
'p_p3_fix_11': {0: nan},
'p_p3_fix_12': {0: nan},
'p_p3_var_01': {0: 0.070232},
'p_p3_var_02': {0: 0.070232},
'p_p3_var_03': {0: 0.073773},
'p_p3_var_04': {0: 0.073773},
'p_p3_var_05': {0: nan},
'p_p3_var_06': {0: nan},
'p_p3_var_07': {0: nan},
'p_p3_var_08': {0: nan},
'p_p3_var_09': {0: nan},
'p_p3_var_10': {0: nan},
'p_p3_var_11': {0: nan},
'p_p3_var_12': {0: nan}}

How to re-write the script so that it does the job quicker?

3
  • 3
    what is your expected output look like ? Commented Apr 1, 2018 at 20:20
  • 2
    Welcome to SO. Care to provide a minimal reproducible example? In other words, no images, no links, just text that we can copy/paste and use in a working example ourselves? [Hint: try df.head().to_dict() and fix your code indentation.] Commented Apr 1, 2018 at 21:34
  • @jpp Thanks - I added example input and expected output. Commented Apr 1, 2018 at 22:13

2 Answers 2

1
+50

Luckily this can be done very simply:

df = df.pivot(index='id', columns='p_date')

This will reshape the df using hierarchical indexing for the columns. To flatten the hierarchy levels into just one, you can use the solution found here:

df.columns = [' '.join(col).strip() for col in df.columns.values]

Which sets the column names to a join of its respective name for each of the hierarchical levels present in the previous step. Now you have a column for each combination of date with the other value columns.

Sign up to request clarification or add additional context in comments.

Comments

0

If I understand clearly , you are looking for filter + groupby

hist['p_date']=pd.to_datetime(hist['p_date'])
s=hist.loc[hist['p_date'].isin(dates),:]
s.groupby([s['id'],s['p_date'].dt.month]).sum().reset_index()

1 Comment

Thanks but I'm not sure that this is what I need. I edited the post to include info on the expected output.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.