5

When selecting rows whose column value column_name equals a scalar, some_value, we use ==:

df.loc[df['column_name'] == some_value]

or use .query()

df.query('column_name == some_value')

In a concrete example:

import pandas as pd
import numpy as np
df = pd.DataFrame({'Col1': 'what are men to rocks and mountains'.split(),
                   'Col2': 'the curves of your lips rewrite history.'.split(),
                   'Col3': np.arange(7),
                   'Col4': np.arange(7) * 8})

print(df)

         Col1      Col2  Col3  Col4
0       what       the     0     0
1        are    curves     1     8
2        men        of     2    16
3         to      your     3    24
4      rocks      lips     4    32
5        and   rewrite     5    40
6  mountains  history      6    48

A query could be

rocks_row = df.loc[df['Col1'] == "rocks"]

which outputs

print(rocks_row)
    Col1  Col2  Col3  Col4
4  rocks  lips     4    32

I would like to pass through a list of values to query against a dataframe, which outputs a list of "correct queries".

The queries to execute would be in a list, e.g.

list_match = ['men', 'curves', 'history']

which would output all rows which meet this condition, i.e.

matches = pd.concat([df1, df2, df3]) 

where

df1 = df.loc[df['Col1'] == "men"]

df2 = df.loc[df['Col1'] == "curves"]

df3 = df.loc[df['Col1'] == "history"]

My idea would be to create a function that takes in a

output = []
def find_queries(dataframe, column, value, output):
    for scalar in value: 
        query = dataframe.loc[dataframe[column] == scalar]]
        output.append(query)    # append all query results to a list
    return pd.concat(output)    # return concatenated list of dataframes

However, this appears to be exceptionally slow, and doesn't actually take advantage of the pandas data structure. What is the "standard" way to pass through a list of queries through a pandas dataframe?

EDIT: How does this translate into "more complex" queries in pandas? e.g. where with an HDF5 document?

df.to_hdf('test.h5','df',mode='w',format='table',data_columns=['A','B'])

pd.read_hdf('test.h5','df')

pd.read_hdf('test.h5','df',where='A=["foo","bar"] & B=1')
2
  • 1
    It's not quite clear what do you want to achieve... Do you want to get a list of DFs satisfying different conditions or to get one DF satisfying all of them? Commented Oct 12, 2016 at 8:57
  • @MaxU This is unclear---apologies. One DF satisfying all conditions. Commented Oct 12, 2016 at 16:13

2 Answers 2

2

If I understood your question correctly you can do it either using boolean indexing as @uhjish has already shown in his answer or using query() method:

In [30]: search_list = ['rocks','mountains']

In [31]: df
Out[31]:
        Col1      Col2  Col3  Col4
0       what       the     0     0
1        are    curves     1     8
2        men        of     2    16
3         to      your     3    24
4      rocks      lips     4    32
5        and   rewrite     5    40
6  mountains  history.     6    48

.query() method:

In [32]: df.query('Col1 in @search_list and Col4 > 40')
Out[32]:
        Col1      Col2  Col3  Col4
6  mountains  history.     6    48

In [33]: df.query('Col1 in @search_list')
Out[33]:
        Col1      Col2  Col3  Col4
4      rocks      lips     4    32
6  mountains  history.     6    48

using boolean indexing:

In [34]: df.ix[df.Col1.isin(search_list) & (df.Col4 > 40)]
Out[34]:
        Col1      Col2  Col3  Col4
6  mountains  history.     6    48

In [35]: df.ix[df.Col1.isin(search_list)]
Out[35]:
        Col1      Col2  Col3  Col4
4      rocks      lips     4    32
6  mountains  history.     6    48

UPDATE: using function:

def find_queries(df, qry, debug=0, **parms):
    if debug:
        print('[DEBUG]: Query:\t' + qry.format(**parms))
    return df.query(qry.format(**parms))

In [31]: find_queries(df, 'Col1 in {Col1} and Col4 > {Col4}', Col1='@search_list', Col4=40)
    ...:
Out[31]:
        Col1      Col2  Col3  Col4
6  mountains  history.     6    48

In [32]: find_queries(df, 'Col1 in {Col1} and Col4 > {Col4}', Col1='@search_list', Col4=10)
Out[32]:
        Col1      Col2  Col3  Col4
4      rocks      lips     4    32
6  mountains  history.     6    48

including debugging info (print query):

In [40]: find_queries(df, 'Col1 in {Col1} and Col4 > {Col4}', Col1='@search_list', Col4=10, debug=1)
[DEBUG]: Query: Col1 in @search_list and Col4 > 10
Out[40]:
        Col1      Col2  Col3  Col4
4      rocks      lips     4    32
6  mountains  history.     6    48
Sign up to request clarification or add additional context in comments.

2 Comments

And the function here would simply be find_queries = lambda df, col, values: df.("col in values)
Perfect! You are several steps ahead of me in implementing debugging functionality.
1

The best way to deal with this is by indexing into the rows using a Boolean series as you would in R.

Using your df as an example,

In [5]: df.Col1 == "what"
Out[5]:
0     True
1    False
2    False
3    False
4    False
5    False
6    False
Name: Col1, dtype: bool

In [6]: df[df.Col1 == "what"]
Out[6]:
   Col1 Col2  Col3  Col4
0  what  the     0     0

Now we combine this with the pandas isin function.

In [8]: df[df.Col1.isin(["men","rocks","mountains"])]
Out[8]:
        Col1      Col2  Col3  Col4
2        men        of     2    16
4      rocks      lips     4    32
6  mountains  history.     6    48

To filter on multiple columns we can chain them together with & and | operators like so.

In [10]: df[df.Col1.isin(["men","rocks","mountains"]) | df.Col2.isin(["lips","your"])]
Out[10]:
        Col1      Col2  Col3  Col4
2        men        of     2    16
3         to      your     3    24
4      rocks      lips     4    32
6  mountains  history.     6    48

In [11]: df[df.Col1.isin(["men","rocks","mountains"]) & df.Col2.isin(["lips","your"])]
Out[11]:
    Col1  Col2  Col3  Col4
4  rocks  lips     4    32

2 Comments

The question above may have been unclear---I'm looking for a function that does this. Users input a list of values, a list of query hits are output.
Not sure I understand the problem here. You can use the isin function to achieve what you need. If I were to rewrite your find_queries function, I'd do it like this: find_queries = lambda df, col, values: df[ df[col].isin(values) ]

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.