Pandas: filter rows based on lists of columns

Question

I have a large df with 1000s of columns, shorter version here:

largedf = pd.DataFrame({'arow': ['row1', 'row2', 'row3', 'row4'], 'bread': ['b', 'b', 'b', 'a'], 'fruit': ['c', 'b', 'b', 'a'], 
                   'tea': ['b', 'a', 'b', 'a'], 'water': ['b', 'c', 'b', 'c']})
   arow     bread  fruit tea   water
0  row1     b      c     b     b
1  row2     b      b     a     c
2  row3     b      b     b     b
3  row4     a      a     a     c

I want to save rows that have exactly one category without b, where the categories are defined as the lists (once again, actually lots more lists than 2):

food = ['bread', 'fruit']
drink = ['tea', 'water']

row2 is the only row that would be saved in this case. row1 doesn't have a category without b, row3 is all b, row4 is all notb

The preferred output would have a column for the single notb category and what percentage of notb is in that row:

   arow     bread  fruit tea   water category perc
1  row2     b      b     a     c     drink    0.5

DJK · Accepted Answer · 2018-06-21 18:57:21Z

2

Take a count of boolean locations of b based on the lists you provided

largedf['drink'] = (largedf[drink] == 'b').sum(1)
largedf['food'] = (largedf[food] == 'b').sum(1)

Now filter on your conditions. In this toy example the multiplication of the counts must equal zero and the sum must be greater than zero

largedf[(largedf.drink * largedf.food == 0) & 
        (largedf.drink + largedf.food != 0)]

   arow bread fruit tea water  drink  food
1  row2     b     b   a     c      0     2

answered Jun 21, 2018 at 18:57

DJK

9,3424 gold badges28 silver badges41 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Dickster Over a year ago

nice concise solution

Liquidity Over a year ago

@DJK The counts are switched - I need to know the count for the nonb category - so drink would be 2 and food would be 0. Is there a way to fix that?

Dickster · Accepted Answer · 2018-06-21 19:52:26Z

2

I present a solution here where I try to show that your DataFrame would benefit from a multi-index.

largedf = pd.DataFrame({'arow': ['row1', 'row2', 'row3', 'row4'], 'bread': ['b', 'b', 'b', 'a'], 'fruit': ['c', 'b', 'b', 'a'],
                   'tea': ['b', 'a', 'b', 'a'], 'water': ['b', 'c', 'b', 'c']})

largedf.set_index('arow',inplace=True)

food = ['bread', 'fruit']
drink = ['tea', 'water']
dict = {'food':food,'drink':drink}

l = []
for k,v in dict.iteritems():
    for y in v:
        l.append((k,y))

largedf.columns = pd.MultiIndex.from_tuples(l)
print largedf

      food       drink      
     bread fruit   tea water
arow                        
row1     b     c     b     b
row2     b     b     a     c
row3     b     b     b     b
row4     a     a     a     c

idx = pd.IndexSlice
cond1 = (largedf.loc[:,idx['food']] == 'b').any(axis=1) *1
cond2 = (largedf.loc[:,idx['drink']]== 'b').any(axis=1) *1

# you want rows where (cond1 + cond2) = 1
largedf[('perc','perc')] = largedf.apply(lambda x: (x =='b').sum()/4. ,axis=1)
print largedf.join(pd.DataFrame(((cond1 + cond2) == 1),columns=[('match','match')]))

      food       drink         perc  match
     bread fruit   tea water   perc  match
arow                                      
row1     b     c     b     b 0.7500  False
row2     b     b     a     c 0.5000   True
row3     b     b     b     b 1.0000  False
row4     a     a     a     c 0.0000  False

edited Jun 21, 2018 at 19:52

answered Jun 21, 2018 at 18:58

Dickster

3,0193 gold badges25 silver badges29 bronze badges

2 Comments

Liquidity Over a year ago

Is there a way to generate those tuples from my lists instead of typing them out manually?

Dickster Over a year ago

yes - I'll edit the question and use the fact you gave this: food = ['bread', 'fruit'] drink = ['tea', 'water']

hootnot · Accepted Answer · 2018-06-21 18:36:46Z

0

Maybe this is something, but you need to add in your own logic in the filter:

def fltr(df):
    # empty result frame, same index as df
    dfR = pd.DataFrame(index=df.index)
    # Insert your logic here
    for i, row in df.iloc[:].iterrows():
        if row['bread'] == 'b' and row['fruit'] == 'b':
            # just copy the row in this case 
            for k, v in row.items():
                dfR.loc[i, k] = v
            # add single col. items
            #dfR.loc[i, 'bread'] = "b"
            #dfR.loc[i, 'fruit'] = "f"
    # etc

    return dfR

food = ['bread', 'fruit']
drink = ['tea', 'water']
largedf = pd.DataFrame({'arow': ['row1', 'row2', 'row3', 'row4'],
                'bread': ['b', 'b', 'b', 'a'], 'fruit': ['c', 'b', 'b', 'a'], 
               'tea': ['b', 'a', 'b', 'a'], 'water': ['b', 'c', 'b', 'c']})
print(largedf)
resultDF = largedf.pipe(fltr)
print(resultDF)


arow    bread   fruit   tea water
0   row1    b   c   b   b
1   row2    b   b   a   c
2   row3    b   b   b   b
3   row4    a   a   a   c

arow    bread   fruit   tea water
0   NaN NaN NaN NaN NaN
1   row2    b   b   a   c
2   row3    b   b   b   b
3   NaN NaN NaN NaN NaN

answered Jun 21, 2018 at 18:36

hootnot

1,0148 silver badges14 bronze badges

3 Comments

Liquidity Over a year ago

The data I gave is toy data, in actuality I have a df with thousands of columns and hundreds of lists. A more automated way than defining each scenario is needed.

hootnot Over a year ago

maybe that is the 'your own logic' part. But I leave that up to you

Ben.T Over a year ago

@Liquidity I think you should add this information in your question, because your toy data could be solved several ways but might not be adaptable to your case. Just an idea while I'm commenting, np.select could be a solution :)

Collectives™ on Stack Overflow

Pandas: filter rows based on lists of columns

3 Answers 3

2 Comments

2 Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

2 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related