Count values in a dataframe based on values in another dataframe

Question

I'd be grateful for any help anyone can offer on this, as have been tearing my hair out trying to solve it.

I have two python pandas dataframes, in simplified form they look like this:

df1

+-----+-----+-----+
| a_1 | a_2 | a_3 |
+-----+-----+-----+
|   0 |   2 |   5 |
|   1 |   3 |   4 |
|   0 |   0 |   0 |
+-----+-----+-----+

df2

+-----+-----+-----+
| b_1 | b_2 | b_3 |
+-----+-----+-----+
|   0 |   0 |   1 |
|   1 |   0 |   1 |
|   0 |   0 |   0 |
+-----+-----+-----+

I want to create a count column of non-null values (per row) in df1, if the equivalent cell is non-null in df2. The column titles in both dataframes are not the same, but are the same after the initial a_ and b_ prefixes.

So in this example the code would just count the third value in the first row, and the first and second in the second row. The new df1 dataframe would therefore look like this:

new_df1

+-----+-----+-----+----------------------+
| a_1 | a_2 | a_3 | count_if_nonnull_df2 |
+-----+-----+-----+----------------------+
|   0 |   2 |   5 |                    1 |
|   1 |   3 |   4 |                    2 |
|   0 |   0 |   0 |                    0 |
+-----+-----+-----+----------------------+

Would anyone be able to help?! Thanks in advance.

Yes! There's a whole lot of nans in the actual dataset, but I specifically mean where the value in df2 is greater than 0. — Michael Dowling
– Michael Dowling, Commented Mar 26, 2018 at 21:11

TayTay · Accepted Answer · 2018-03-26 21:08:00Z

1

Assuming by "non-null" you mean "non-zero" (per your example), try this...

Problem setup:

>>> df1 = pd.DataFrame.from_dict({'a_1':[0,1,0], 'a_2':[2,3,0], 'a_3':[5,4,0]})
>>> df2 = pd.DataFrame.from_dict({'b_1':[0,1,0], 'b_2':[0,0,0], 'b_3':[1,1,0]})

Using a mask we cast to ints, we can compute row-wise sums:

>>> df1['count_if_nonnull_df2'] = (df2 > 0).astype(int).sum(axis=1)
>>> df1
   a_1  a_2  a_3  count_if_nonnull_df2
0    0    2    5                     1
1    1    3    4                     2
2    0    0    0                     0

answered Mar 26, 2018 at 21:08

TayTay

7,2385 gold badges48 silver badges69 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

BENY Over a year ago

If df2 = pd.DataFrame.from_dict({'b_1':[1,1,0], 'b_2':[0,0,0], 'b_3':[1,1,0]}), output may different

TayTay Over a year ago

@Wen If you're talking about column ordering, yes, since I constructed using from_dict. If you're referring to the fact that there's an additional value in the 'b_1' column, yes, re-read the question: "...if the equivalent cell is non-null in df2...". OP is not interested in the nulls in df1 & df2, just df2.

BENY Over a year ago

I think I mean. non-null values (per row) in df1, if the equivalent cell is non-null in df2, if df1 is null itself, in this case should we consider both of the df1 and df2 ?

TayTay Over a year ago

I don't think that's what OP asked for, but he'll have to clarify if I'm wrong.

Michael Dowling Over a year ago

As I was implementing the solution proposed, I realised for my problem I actually wanted to count values of 0 in df1 and not values >0 where df2 >0, but think was able to reverse engineer that from this solution.

|

n3utrino · Accepted Answer · 2018-03-26 21:08:44Z

0

Well, if the a_ df and the b_ df were matrices, you could element-wise multiply the two together. Note this assumes the columns of each df are in the correct order (easy to accomplish, if not). For your example, this would result in a matrix like

0 0 5
1 0 4
0 0 0

You can then count how many of these are nonzero in each row.

You can convert each dataframe to a numpy array with df.as_matrix(), multiply the two together simply with result = first_mtx * second_mtx, then count_nonzero with axis = 1.

first_array = a_df.as_matrix()
second_array = b_df.as_matrix()
count_if_nonnull_df2 = np.count_nonzero(first_array*second_array,axis=1)

answered Mar 26, 2018 at 21:08

n3utrino

1,2101 gold badge9 silver badges16 bronze badges

1 Comment

Michael Dowling Over a year ago

Thank you for this n3utrino! I used the solution provided by Tgsmith above, but really like this idea of the matrix solution, and must bear that in mind for future problems.

BENY · Accepted Answer · 2018-03-26 21:10:28Z

0

I think it can be

df1['countif']=np.sum((df1.ne(0).values)&(df2.ne(0).values),1)
df1
Out[703]: 
   a_1  a_2  a_3  countif
0    0    2    5        1
1    1    3    4        2
2    0    0    0        0

answered Mar 26, 2018 at 21:10

BENY

324k22 gold badges176 silver badges250 bronze badges

1 Comment

Michael Dowling Over a year ago

That's really helpful, thank you Wen! Am going to try this tomorrow to see if it gives similar results, as am working with a very large dataset thats hard to visualise. So triangulating the approach by trying this aswell to check accuracy, would be really useful.

Collectives™ on Stack Overflow

Count values in a dataframe based on values in another dataframe

3 Answers 3

7 Comments

1 Comment

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

7 Comments

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related