2

Using the following test data:

df2 = pd.DataFrame(np.random.randn(12, 3), columns=['A', 'B', 'C'])
thresh = .3
df2['matches'] = np.where(df2.A - df2.B < thresh,1,0)

I created the df2['matches'] column showing a value of 1 when df2.A - df2.B < thresh.

        A           B            C      matches
0   0.501554    -0.589855   -0.751568   0
1   -0.295198   0.512442    0.466915    1
2   0.074863    0.343388    -1.700998   1
3   0.115432    -0.507847   -0.825545   0
4   1.013837    -0.007333   -0.292192   0
5   -0.930738   1.235501    -0.652071   1
6   -1.026615   1.389294    0.035041    1
7   0.969147    -0.397276   1.272235    0
8   0.120461    -0.634686   -1.123046   0
9   0.956896    -0.345948   -0.620748   0
10  -0.552476   1.376459    0.447807    1
11  0.882275    0.490049    0.713033    0

However, I actually would like to compare all three columns and if the values are within thresh it will return a number corresponding with the amount of matches in df2['matches].

So for example if Col A = 1, B = 2 and C = 1.5 and thresh was .5 the function would return 3 in the ['matches'] column.

Is there a function that already does something similar or can anyone help with this?

3 Answers 3

2

You can use the threshold for each pair of your columns, then sum up the resulting boolean columns to obtain the number you need. Note, however, that this number depends on the order in which you compare columns. This ambiguity would be gone if you used abs(df['A']-df['B']) etc, and this might very well be your intention. Below I'll assume this is what you need.

Generally, you can use itertools.combinations to produce each pair of columns once:

from itertools import combinations
df = pd.DataFrame(np.random.randn(12, 3), columns=['A', 'B', 'C'])
thresh = .3
df['matches'] = sum(abs(df[k1]-df[k2])<thresh for k1,k2 in combinations(df.keys(),2))

The generator expression in the sum() loops over every column pair, and constructs the respective boolean vector. These are summed for each column pair, and the resulting column is appended to the dataframe.

Example output for thresh = 0.3:

           A         B         C  matches
0   0.146360 -0.099707  0.633632        1
1   1.462810 -0.186317 -1.411988        0
2   0.358827 -0.758619  0.038329        0
3   0.077122 -0.213856 -0.619768        1
4   0.215555  1.930888 -0.488517        0
5  -0.946557 -0.904743 -0.004738        1
6  -0.080209 -0.850830 -0.866865        1
7  -0.997710 -0.580679 -2.231168        0
8   1.762313 -0.356464 -1.813028        0
9   1.151338  0.347636 -1.323791        0
10  0.248432  1.265484  0.048484        1
11  0.559934 -0.401059  0.863616        0

Using itertools.combinations, the columns are compared as

>>> [k for k in itertools.combinations(df.keys(),2)]
('A', 'B'), ('A', 'C'), ('B', 'C')]

but this really doesn't matter if you're using the absolute value (since then the difference is symmetric with respect to columns).

Sign up to request clarification or add additional context in comments.

4 Comments

Thanks @Andras Deak. Maybe there is a problem with my question though as line 10 shows 3 matches and the difference between 1.260968 and 0.690971 is > thresh. I am looking for matches when the difference between the numbers is < thresh.
@adele see my last code block: the order we have is ('B','C'), so we compute 'B' column minus 'C' column, which is negative for this case. You might want to have it all the way around, by swapping k1 and k2 in the list comprehension (edited; now a generator expression, inside sum()); but the most likely case is that you need the absolute value of the difference, rather than the difference itself. Do you see what I mean?
Can you pls show me how I get the absolute value of the difference and I will see if this gives the result I was expecting, thanks
@adele I'm glad I could help:)
1

Try this guy:

df2['matches'] = df2.apply(lambda x: sum([x[i] - x[j] <= thresh for i, j in [(0, 1), (0, 2), (1, 2)]]), axis=1)

It could be generalized to any number of columns if necessary.

Comments

-2

Here's a way to do it:

df2 = pd.DataFrame(np.random.randn(12, 3), columns=['A', 'B', 'C'])
thresh = 0.3

newcol = []
for row in df2.iterrows():
     newcol.append(sum([v > thresh for v in list(row[1])]))
df2['matches'] = newcol

1 Comment

"How many columns are >thresh" would be answerable with much less work; that is not the question.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.