Function for matching values in multiple columns

Question

Using the following test data:

df2 = pd.DataFrame(np.random.randn(12, 3), columns=['A', 'B', 'C'])
thresh = .3
df2['matches'] = np.where(df2.A - df2.B < thresh,1,0)

I created the df2['matches'] column showing a value of 1 when df2.A - df2.B < thresh.

        A           B            C      matches
0   0.501554    -0.589855   -0.751568   0
1   -0.295198   0.512442    0.466915    1
2   0.074863    0.343388    -1.700998   1
3   0.115432    -0.507847   -0.825545   0
4   1.013837    -0.007333   -0.292192   0
5   -0.930738   1.235501    -0.652071   1
6   -1.026615   1.389294    0.035041    1
7   0.969147    -0.397276   1.272235    0
8   0.120461    -0.634686   -1.123046   0
9   0.956896    -0.345948   -0.620748   0
10  -0.552476   1.376459    0.447807    1
11  0.882275    0.490049    0.713033    0

However, I actually would like to compare all three columns and if the values are within thresh it will return a number corresponding with the amount of matches in df2['matches].

So for example if Col A = 1, B = 2 and C = 1.5 and thresh was .5 the function would return 3 in the ['matches'] column.

Is there a function that already does something similar or can anyone help with this?

Andras Deak -- Слава Україні · Accepted Answer · 2016-12-03 00:31:19Z

2

You can use the threshold for each pair of your columns, then sum up the resulting boolean columns to obtain the number you need. Note, however, that this number depends on the order in which you compare columns. This ambiguity would be gone if you used abs(df['A']-df['B']) etc, and this might very well be your intention. Below I'll assume this is what you need.

Generally, you can use itertools.combinations to produce each pair of columns once:

from itertools import combinations
df = pd.DataFrame(np.random.randn(12, 3), columns=['A', 'B', 'C'])
thresh = .3
df['matches'] = sum(abs(df[k1]-df[k2])<thresh for k1,k2 in combinations(df.keys(),2))

The generator expression in the sum() loops over every column pair, and constructs the respective boolean vector. These are summed for each column pair, and the resulting column is appended to the dataframe.

Example output for thresh = 0.3:

           A         B         C  matches
0   0.146360 -0.099707  0.633632        1
1   1.462810 -0.186317 -1.411988        0
2   0.358827 -0.758619  0.038329        0
3   0.077122 -0.213856 -0.619768        1
4   0.215555  1.930888 -0.488517        0
5  -0.946557 -0.904743 -0.004738        1
6  -0.080209 -0.850830 -0.866865        1
7  -0.997710 -0.580679 -2.231168        0
8   1.762313 -0.356464 -1.813028        0
9   1.151338  0.347636 -1.323791        0
10  0.248432  1.265484  0.048484        1
11  0.559934 -0.401059  0.863616        0

Using itertools.combinations, the columns are compared as

>>> [k for k in itertools.combinations(df.keys(),2)]
('A', 'B'), ('A', 'C'), ('B', 'C')]

but this really doesn't matter if you're using the absolute value (since then the difference is symmetric with respect to columns).

edited Dec 3, 2016 at 0:31

answered Dec 2, 2016 at 23:43

Andras Deak -- Слава Україні

35.4k13 gold badges94 silver badges118 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

nipy Over a year ago

Thanks @Andras Deak. Maybe there is a problem with my question though as line 10 shows 3 matches and the difference between 1.260968 and 0.690971 is > thresh. I am looking for matches when the difference between the numbers is < thresh.

Andras Deak -- Слава Україні Over a year ago

@adele see my last code block: the order we have is ('B','C'), so we compute 'B' column minus 'C' column, which is negative for this case. You might want to have it all the way around, by swapping k1 and k2 in the list comprehension (edited; now a generator expression, inside sum()); but the most likely case is that you need the absolute value of the difference, rather than the difference itself. Do you see what I mean?

nipy Over a year ago

Can you pls show me how I get the absolute value of the difference and I will see if this gives the result I was expecting, thanks

Andras Deak -- Слава Україні Over a year ago

@adele I'm glad I could help:)

Alex · Accepted Answer · 2016-12-02 23:46:16Z

1

Try this guy:

df2['matches'] = df2.apply(lambda x: sum([x[i] - x[j] <= thresh for i, j in [(0, 1), (0, 2), (1, 2)]]), axis=1)

It could be generalized to any number of columns if necessary.

answered Dec 2, 2016 at 23:46

Alex

13.1k8 gold badges70 silver badges77 bronze badges

Comments

calico_ · Accepted Answer · 2016-12-03 00:04:53Z

-2

Here's a way to do it:

df2 = pd.DataFrame(np.random.randn(12, 3), columns=['A', 'B', 'C'])
thresh = 0.3

newcol = []
for row in df2.iterrows():
     newcol.append(sum([v > thresh for v in list(row[1])]))
df2['matches'] = newcol

answered Dec 3, 2016 at 0:04

calico_

1,22114 silver badges23 bronze badges

1 Comment

Andras Deak -- Слава Україні Over a year ago

"How many columns are >thresh" would be answerable with much less work; that is not the question.

Collectives™ on Stack Overflow

Function for matching values in multiple columns

3 Answers 3

4 Comments

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related