2

I'd be grateful for any help anyone can offer on this, as have been tearing my hair out trying to solve it.

I have two python pandas dataframes, in simplified form they look like this:

df1

+-----+-----+-----+
| a_1 | a_2 | a_3 |
+-----+-----+-----+
|   0 |   2 |   5 |
|   1 |   3 |   4 |
|   0 |   0 |   0 |
+-----+-----+-----+

df2

+-----+-----+-----+
| b_1 | b_2 | b_3 |
+-----+-----+-----+
|   0 |   0 |   1 |
|   1 |   0 |   1 |
|   0 |   0 |   0 |
+-----+-----+-----+

I want to create a count column of non-null values (per row) in df1, if the equivalent cell is non-null in df2. The column titles in both dataframes are not the same, but are the same after the initial a_ and b_ prefixes.

So in this example the code would just count the third value in the first row, and the first and second in the second row. The new df1 dataframe would therefore look like this:

new_df1

+-----+-----+-----+----------------------+
| a_1 | a_2 | a_3 | count_if_nonnull_df2 |
+-----+-----+-----+----------------------+
|   0 |   2 |   5 |                    1 |
|   1 |   3 |   4 |                    2 |
|   0 |   0 |   0 |                    0 |
+-----+-----+-----+----------------------+

Would anyone be able to help?! Thanks in advance.

2
  • notnull mean not zero right ? Commented Mar 26, 2018 at 21:07
  • Yes! There's a whole lot of nans in the actual dataset, but I specifically mean where the value in df2 is greater than 0. Commented Mar 26, 2018 at 21:11

3 Answers 3

1

Assuming by "non-null" you mean "non-zero" (per your example), try this...

Problem setup:

>>> df1 = pd.DataFrame.from_dict({'a_1':[0,1,0], 'a_2':[2,3,0], 'a_3':[5,4,0]})
>>> df2 = pd.DataFrame.from_dict({'b_1':[0,1,0], 'b_2':[0,0,0], 'b_3':[1,1,0]})

Using a mask we cast to ints, we can compute row-wise sums:

>>> df1['count_if_nonnull_df2'] = (df2 > 0).astype(int).sum(axis=1)
>>> df1
   a_1  a_2  a_3  count_if_nonnull_df2
0    0    2    5                     1
1    1    3    4                     2
2    0    0    0                     0
Sign up to request clarification or add additional context in comments.

7 Comments

If df2 = pd.DataFrame.from_dict({'b_1':[1,1,0], 'b_2':[0,0,0], 'b_3':[1,1,0]}), output may different
@Wen If you're talking about column ordering, yes, since I constructed using from_dict. If you're referring to the fact that there's an additional value in the 'b_1' column, yes, re-read the question: "...if the equivalent cell is non-null in df2...". OP is not interested in the nulls in df1 & df2, just df2.
I think I mean. non-null values (per row) in df1, if the equivalent cell is non-null in df2, if df1 is null itself, in this case should we consider both of the df1 and df2 ?
I don't think that's what OP asked for, but he'll have to clarify if I'm wrong.
As I was implementing the solution proposed, I realised for my problem I actually wanted to count values of 0 in df1 and not values >0 where df2 >0, but think was able to reverse engineer that from this solution.
|
0

Well, if the a_ df and the b_ df were matrices, you could element-wise multiply the two together. Note this assumes the columns of each df are in the correct order (easy to accomplish, if not). For your example, this would result in a matrix like

0 0 5
1 0 4
0 0 0

You can then count how many of these are nonzero in each row.

You can convert each dataframe to a numpy array with df.as_matrix(), multiply the two together simply with result = first_mtx * second_mtx, then count_nonzero with axis = 1.

first_array = a_df.as_matrix()
second_array = b_df.as_matrix()
count_if_nonnull_df2 = np.count_nonzero(first_array*second_array,axis=1)

1 Comment

Thank you for this n3utrino! I used the solution provided by Tgsmith above, but really like this idea of the matrix solution, and must bear that in mind for future problems.
0

I think it can be

df1['countif']=np.sum((df1.ne(0).values)&(df2.ne(0).values),1)
df1
Out[703]: 
   a_1  a_2  a_3  countif
0    0    2    5        1
1    1    3    4        2
2    0    0    0        0

1 Comment

That's really helpful, thank you Wen! Am going to try this tomorrow to see if it gives similar results, as am working with a very large dataset thats hard to visualise. So triangulating the approach by trying this aswell to check accuracy, would be really useful.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.