1

I have a DataFrame (df) that resembles the following:

A    B   
1    2
1    3
1    4
2    5
4    6
4    7
8    9
9    8

I would like to add a column that essentially determines a related cluster based upon the values in columns A and B:

A    B    C   
1    2    a
1    3    a
1    4    a
2    5    a
3    1    a
3    2    a
4    6    a
4    7    a
8    9    b
9    8    b

Note that since 1 (in A) is related to 2 (in B), and 2 (in A) is related to 5 (in B), these are all placed in the same cluster. 8 (in A) is only related to 9 (in B) and are therefore placed in another cluster.

To sum up, how do I define clusters based upon pairwise connections where pairs are defined by two columns in a DataFrame?

2 Answers 2

4

You can view this as a set consolidation problem (with each row describing a set) or a connected component problem (with each row describing an edge between two nodes). AFAIK there's no native support for this, although I've considered submitting a PR adding it to the utility tools.

Anyway, you could do something like:

def consolidate(sets):
    # http://rosettacode.org/wiki/Set_consolidation#Python:_Iterative
    setlist = [s for s in sets if s]
    for i, s1 in enumerate(setlist):
        if s1:
            for s2 in setlist[i+1:]:
                intersection = s1.intersection(s2)
                if intersection:
                    s2.update(s1)
                    s1.clear()
                    s1 = s2
    return [s for s in setlist if s]

def group_ids(pairs):
    groups = consolidate(map(set, pairs))
    d = {}
    for i, group in enumerate(sorted(groups)):
        for elem in group:
            d[elem] = i
    return d

after which we have

>>> df["C"] = df["A"].replace(group_ids(zip(df.A, df.B)))
>>> df
   A  B  C
0  1  2  0
1  1  3  0
2  1  4  0
3  2  5  0
4  4  6  0
5  4  7  0
6  8  9  1
7  9  8  1

and you can replace the 0s and 1s by whatever you want.

Sign up to request clarification or add additional context in comments.

Comments

0

Here is a start (I'm not sure I understood the criteria for grouping into clusters, but, you should be able to add the exact criteria):

import pandas as pd

x = pd.DataFrame({'A': [1,1,1,2,4,4,8,9],
              'B': [2,3,4,5,6,7,9,8]})

## calculate difference between a and be columns
## (substitute any distance/association function)
x['Diff'] = abs(x['A'] - x['B'])

## assign whether row is in a cluster or not.
x['Incluster'] = x['Diff'] <= 1

2 Comments

Clusters are defined by whether or not there is a pairwise connection between the two values. As in my example, (1,2) + (2,5) means (1,5). In addition, there is likely several hundred clusters in my data so binary determinations of cluster-hood will not be sufficient.-
OK, if I understand correctly, the dataframe is an edgelist representing connections in a graph. If so, you can use clustering in graphs: igraph.org/python/doc/igraph.clustering-module.html or networkx.github.io/documentation/latest/reference/generated/…

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.