Let's say I have a pandas dataframe from which I will need to repeatedly query a subset of rows. I'd like to wrap this in a function. The queries will vary, interrogating any number of columns. The operator for each column will always be the same. I'm considering something like this:
df = pd.DataFrame({'A': list('aabbccddeeff'), 'B': list('aaaabbbbcccc'),
'C': np.random.randint(5, size=12),
'D': np.random.randint(9, size=12)})
def query_df(df, **kwds):
a_val = kwds.get('a', None)
b_val = kwds.get('b', None)
c_val = kwds.get('c', None)
d_val = kwds.get('d', None)
query = 'A in {0} and B == {1} and C > {2} and D < {3}'.format(a_val, b_val, c_val, d_val)
return df.query(query)
query_dict = {'a':['a', 'b', 'c', 'd'], 'b':'a', 'c':0, 'd':8}
print(query_df(df, **query_dict))
A B C D
1 a a 1 6
Although this works, it doesn't allow queries directed to e.g. just columns A and C. All columns are hard-coded into the query string! How can I make this more flexible so that e.g. the following would also work:
query_df(df, {'a':['a', 'b', 'c', 'd'], 'b':'a'})
query_df(df, {'b':'a', 'c':6})
query_df(df, {'d':4})
Thanks in advance!