How to pass through a list of queries to a pandas dataframe, and output the list of results?

Question

When selecting rows whose column value column_name equals a scalar, some_value, we use ==:

df.loc[df['column_name'] == some_value]

or use .query()

df.query('column_name == some_value')

In a concrete example:

import pandas as pd
import numpy as np
df = pd.DataFrame({'Col1': 'what are men to rocks and mountains'.split(),
                   'Col2': 'the curves of your lips rewrite history.'.split(),
                   'Col3': np.arange(7),
                   'Col4': np.arange(7) * 8})

print(df)

         Col1      Col2  Col3  Col4
0       what       the     0     0
1        are    curves     1     8
2        men        of     2    16
3         to      your     3    24
4      rocks      lips     4    32
5        and   rewrite     5    40
6  mountains  history      6    48

A query could be

rocks_row = df.loc[df['Col1'] == "rocks"]

which outputs

print(rocks_row)
    Col1  Col2  Col3  Col4
4  rocks  lips     4    32

I would like to pass through a list of values to query against a dataframe, which outputs a list of "correct queries".

The queries to execute would be in a list, e.g.

list_match = ['men', 'curves', 'history']

which would output all rows which meet this condition, i.e.

matches = pd.concat([df1, df2, df3])

where

df1 = df.loc[df['Col1'] == "men"]

df2 = df.loc[df['Col1'] == "curves"]

df3 = df.loc[df['Col1'] == "history"]

My idea would be to create a function that takes in a

output = []
def find_queries(dataframe, column, value, output):
    for scalar in value: 
        query = dataframe.loc[dataframe[column] == scalar]]
        output.append(query)    # append all query results to a list
    return pd.concat(output)    # return concatenated list of dataframes

However, this appears to be exceptionally slow, and doesn't actually take advantage of the pandas data structure. What is the "standard" way to pass through a list of queries through a pandas dataframe?

EDIT: How does this translate into "more complex" queries in pandas? e.g. where with an HDF5 document?

df.to_hdf('test.h5','df',mode='w',format='table',data_columns=['A','B'])

pd.read_hdf('test.h5','df')

pd.read_hdf('test.h5','df',where='A=["foo","bar"] & B=1')

It's not quite clear what do you want to achieve... Do you want to get a list of DFs satisfying different conditions or to get one DF satisfying all of them? — MaxU - stand with Ukraine
– MaxU - stand with Ukraine, Commented Oct 12, 2016 at 8:57
@MaxU This is unclear---apologies. One DF satisfying all conditions. — ShanZhengYang
– ShanZhengYang, Commented Oct 12, 2016 at 16:13

Community · Accepted Answer · 2017-05-23 12:09:02Z

2

If I understood your question correctly you can do it either using boolean indexing as @uhjish has already shown in his answer or using query() method:

In [30]: search_list = ['rocks','mountains']

In [31]: df
Out[31]:
        Col1      Col2  Col3  Col4
0       what       the     0     0
1        are    curves     1     8
2        men        of     2    16
3         to      your     3    24
4      rocks      lips     4    32
5        and   rewrite     5    40
6  mountains  history.     6    48

.query() method:

In [32]: df.query('Col1 in @search_list and Col4 > 40')
Out[32]:
        Col1      Col2  Col3  Col4
6  mountains  history.     6    48

In [33]: df.query('Col1 in @search_list')
Out[33]:
        Col1      Col2  Col3  Col4
4      rocks      lips     4    32
6  mountains  history.     6    48

using boolean indexing:

In [34]: df.ix[df.Col1.isin(search_list) & (df.Col4 > 40)]
Out[34]:
        Col1      Col2  Col3  Col4
6  mountains  history.     6    48

In [35]: df.ix[df.Col1.isin(search_list)]
Out[35]:
        Col1      Col2  Col3  Col4
4      rocks      lips     4    32
6  mountains  history.     6    48

UPDATE: using function:

def find_queries(df, qry, debug=0, **parms):
    if debug:
        print('[DEBUG]: Query:\t' + qry.format(**parms))
    return df.query(qry.format(**parms))

In [31]: find_queries(df, 'Col1 in {Col1} and Col4 > {Col4}', Col1='@search_list', Col4=40)
    ...:
Out[31]:
        Col1      Col2  Col3  Col4
6  mountains  history.     6    48

In [32]: find_queries(df, 'Col1 in {Col1} and Col4 > {Col4}', Col1='@search_list', Col4=10)
Out[32]:
        Col1      Col2  Col3  Col4
4      rocks      lips     4    32
6  mountains  history.     6    48

including debugging info (print query):

In [40]: find_queries(df, 'Col1 in {Col1} and Col4 > {Col4}', Col1='@search_list', Col4=10, debug=1)
[DEBUG]: Query: Col1 in @search_list and Col4 > 10
Out[40]:
        Col1      Col2  Col3  Col4
4      rocks      lips     4    32
6  mountains  history.     6    48

edited May 23, 2017 at 12:09

CommunityBot

11 silver badge

answered Oct 12, 2016 at 17:03

MaxU - stand with Ukraine

212k37 gold badges402 silver badges436 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

ShanZhengYang Over a year ago

And the function here would simply be find_queries = lambda df, col, values: df.("col in values)

ShanZhengYang Over a year ago

Perfect! You are several steps ahead of me in implementing debugging functionality.

uhjish · Accepted Answer · 2016-10-12 01:09:00Z

1

The best way to deal with this is by indexing into the rows using a Boolean series as you would in R.

Using your df as an example,

In [5]: df.Col1 == "what"
Out[5]:
0     True
1    False
2    False
3    False
4    False
5    False
6    False
Name: Col1, dtype: bool

In [6]: df[df.Col1 == "what"]
Out[6]:
   Col1 Col2  Col3  Col4
0  what  the     0     0

Now we combine this with the pandas isin function.

In [8]: df[df.Col1.isin(["men","rocks","mountains"])]
Out[8]:
        Col1      Col2  Col3  Col4
2        men        of     2    16
4      rocks      lips     4    32
6  mountains  history.     6    48

To filter on multiple columns we can chain them together with & and | operators like so.

In [10]: df[df.Col1.isin(["men","rocks","mountains"]) | df.Col2.isin(["lips","your"])]
Out[10]:
        Col1      Col2  Col3  Col4
2        men        of     2    16
3         to      your     3    24
4      rocks      lips     4    32
6  mountains  history.     6    48

In [11]: df[df.Col1.isin(["men","rocks","mountains"]) & df.Col2.isin(["lips","your"])]
Out[11]:
    Col1  Col2  Col3  Col4
4  rocks  lips     4    32

answered Oct 12, 2016 at 1:09

uhjish

1041 silver badge2 bronze badges

2 Comments

ShanZhengYang Over a year ago

The question above may have been unclear---I'm looking for a function that does this. Users input a list of values, a list of query hits are output.

uhjish Over a year ago

Not sure I understand the problem here. You can use the isin function to achieve what you need. If I were to rewrite your find_queries function, I'd do it like this: find_queries = lambda df, col, values: df[ df[col].isin(values) ]

Collectives™ on Stack Overflow

How to pass through a list of queries to a pandas dataframe, and output the list of results?

2 Answers 2

2 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related