1

I have the following pandas Dataframe:

dict1 = {'file': ['filename2', 'filename2', 'filename3', 'filename4', 'filename4', 'filename3'], 'amount': [3, 4, 5, 1, 2, 1], 'front':[21889611, 36357723, 196312, 11, 42, 1992], 'back':[21973805, 36403870, 277500, 19, 120, 3210]}

df1 = pd.DataFrame(dict1)
print(df1)

        file  amount     front      back
0  filename2       3  21889611  21973805
1  filename2       4  36357723  36403870
2  filename3       5    196312    277500
3  filename4       1        11        19
4  filename4       2        42       120
5  filename3       1      1992      3210

My task is to take N random draws between front and back, whereby N is equal to the value in amount. Parse this into a dictionary.

To do this on an row-by-row basis is easy for me to understand:

e.g. row 1

import numpy as np
random_draws = np.random.choice(np.arange(21889611, 21973805+1), 3)

e.g. row 2

random_draws = np.random.choice(np.arange(36357723, 36403870+1), 4)

Normally with pandas, users could define this as a function and use something like

def func(front, back, amount):
    return np.random.choice(np.arange(front, back+1), amount)

df["new_column"].apply(func)

but the result of my function is an array of varying size.

My second problem is that I would like the output to be a dictionary, of the format

{file: [random_draw_results], file: [random_draw_results], file: [random_draw_results], ...}

For the above example df1, the function should output this dictionary (given the draws):

final_dict = {"filename2": [21927457, 21966814, 21898538, 36392840, 36375560, 36384078, 36366833], 
    "filename3": 212143, 239725, 240959, 197359, 276948, 3199], 
    "filename4": [100,  83, 15]}
1
  • Try to use axis=1 Commented Dec 13, 2017 at 20:04

1 Answer 1

1

We can pass axis=1 to operate over rows when using apply. We then need to tell what columns to use and we return a list.

We then either perform some form of groupby or we could use defaultdict as shown below:

dict1 = {'file': ['filename2', 'filename2', 'filename3', 'filename4', 'filename4', 'filename3'], 'amount': [3, 4, 5, 1, 2, 1], 'front':[21889611, 36357723, 196312, 11, 42, 1992], 'back':[21973805, 36403870, 277500, 19, 120, 3210]}

import numpy as np
import pandas as pd

def func(x):
    return np.random.choice(np.arange(x.front, x.back+1), x.amount).tolist()

df1 = pd.DataFrame(dict1)
df1["new_column"] = df1.apply(func, axis=1)

df1.groupby('file')['new_column'].apply(sum).to_dict()

Returns:

{'filename2': [21891765,
  21904680,
  21914414,
  36398355,
  36358161,
  36387670,
  36369443],
 'filename3': [240766, 217580, 217581, 274396, 241413, 2488],
 'filename4': [18, 96, 107]}

Alt2 would be to use (and by some small timings I ran it looks like it runs as fast):

from collections import defaultdict

d = defaultdict(list)

for k,v in df1.set_index('file')['new_column'].items():
    d[k].extend(v)
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks for the help. For the Alt2, I get an odd KeyError: KeyError: 'new_column'
You need to add the column for it. But first is shorter so why not. Consider marking this as accepted if it answered the question.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.