1

I have a large df with 1000s of columns, shorter version here:

largedf = pd.DataFrame({'arow': ['row1', 'row2', 'row3', 'row4'], 'bread': ['b', 'b', 'b', 'a'], 'fruit': ['c', 'b', 'b', 'a'], 
                   'tea': ['b', 'a', 'b', 'a'], 'water': ['b', 'c', 'b', 'c']})
   arow     bread  fruit tea   water
0  row1     b      c     b     b
1  row2     b      b     a     c
2  row3     b      b     b     b
3  row4     a      a     a     c

I want to save rows that have exactly one category without b, where the categories are defined as the lists (once again, actually lots more lists than 2):

food = ['bread', 'fruit']
drink = ['tea', 'water']

row2 is the only row that would be saved in this case. row1 doesn't have a category without b, row3 is all b, row4 is all notb

The preferred output would have a column for the single notb category and what percentage of notb is in that row:

   arow     bread  fruit tea   water category perc
1  row2     b      b     a     c     drink    0.5

3 Answers 3

2

Take a count of boolean locations of b based on the lists you provided

largedf['drink'] = (largedf[drink] == 'b').sum(1)
largedf['food'] = (largedf[food] == 'b').sum(1)

Now filter on your conditions. In this toy example the multiplication of the counts must equal zero and the sum must be greater than zero

largedf[(largedf.drink * largedf.food == 0) & 
        (largedf.drink + largedf.food != 0)]

   arow bread fruit tea water  drink  food
1  row2     b     b   a     c      0     2
Sign up to request clarification or add additional context in comments.

2 Comments

nice concise solution
@DJK The counts are switched - I need to know the count for the nonb category - so drink would be 2 and food would be 0. Is there a way to fix that?
2

I present a solution here where I try to show that your DataFrame would benefit from a multi-index.

largedf = pd.DataFrame({'arow': ['row1', 'row2', 'row3', 'row4'], 'bread': ['b', 'b', 'b', 'a'], 'fruit': ['c', 'b', 'b', 'a'],
                   'tea': ['b', 'a', 'b', 'a'], 'water': ['b', 'c', 'b', 'c']})

largedf.set_index('arow',inplace=True)

food = ['bread', 'fruit']
drink = ['tea', 'water']
dict = {'food':food,'drink':drink}

l = []
for k,v in dict.iteritems():
    for y in v:
        l.append((k,y))

largedf.columns = pd.MultiIndex.from_tuples(l)
print largedf

      food       drink      
     bread fruit   tea water
arow                        
row1     b     c     b     b
row2     b     b     a     c
row3     b     b     b     b
row4     a     a     a     c

idx = pd.IndexSlice
cond1 = (largedf.loc[:,idx['food']] == 'b').any(axis=1) *1
cond2 = (largedf.loc[:,idx['drink']]== 'b').any(axis=1) *1

# you want rows where (cond1 + cond2) = 1
largedf[('perc','perc')] = largedf.apply(lambda x: (x =='b').sum()/4. ,axis=1)
print largedf.join(pd.DataFrame(((cond1 + cond2) == 1),columns=[('match','match')]))

      food       drink         perc  match
     bread fruit   tea water   perc  match
arow                                      
row1     b     c     b     b 0.7500  False
row2     b     b     a     c 0.5000   True
row3     b     b     b     b 1.0000  False
row4     a     a     a     c 0.0000  False

2 Comments

Is there a way to generate those tuples from my lists instead of typing them out manually?
yes - I'll edit the question and use the fact you gave this: food = ['bread', 'fruit'] drink = ['tea', 'water']
0

Maybe this is something, but you need to add in your own logic in the filter:

def fltr(df):
    # empty result frame, same index as df
    dfR = pd.DataFrame(index=df.index)
    # Insert your logic here
    for i, row in df.iloc[:].iterrows():
        if row['bread'] == 'b' and row['fruit'] == 'b':
            # just copy the row in this case 
            for k, v in row.items():
                dfR.loc[i, k] = v
            # add single col. items
            #dfR.loc[i, 'bread'] = "b"
            #dfR.loc[i, 'fruit'] = "f"
    # etc

    return dfR

food = ['bread', 'fruit']
drink = ['tea', 'water']
largedf = pd.DataFrame({'arow': ['row1', 'row2', 'row3', 'row4'],
                'bread': ['b', 'b', 'b', 'a'], 'fruit': ['c', 'b', 'b', 'a'], 
               'tea': ['b', 'a', 'b', 'a'], 'water': ['b', 'c', 'b', 'c']})
print(largedf)
resultDF = largedf.pipe(fltr)
print(resultDF)


arow    bread   fruit   tea water
0   row1    b   c   b   b
1   row2    b   b   a   c
2   row3    b   b   b   b
3   row4    a   a   a   c

arow    bread   fruit   tea water
0   NaN NaN NaN NaN NaN
1   row2    b   b   a   c
2   row3    b   b   b   b
3   NaN NaN NaN NaN NaN

3 Comments

The data I gave is toy data, in actuality I have a df with thousands of columns and hundreds of lists. A more automated way than defining each scenario is needed.
maybe that is the 'your own logic' part. But I leave that up to you
@Liquidity I think you should add this information in your question, because your toy data could be solved several ways but might not be adaptable to your case. Just an idea while I'm commenting, np.select could be a solution :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.