Grouping Data into Clusters Based on DataFrame Columns

Question

I have a DataFrame (df) that resembles the following:

I would like to add a column that essentially determines a related cluster based upon the values in columns A and B:

A    B    C   
1    2    a
1    3    a
1    4    a
2    5    a
3    1    a
3    2    a
4    6    a
4    7    a
8    9    b
9    8    b

Note that since 1 (in A) is related to 2 (in B), and 2 (in A) is related to 5 (in B), these are all placed in the same cluster. 8 (in A) is only related to 9 (in B) and are therefore placed in another cluster.

To sum up, how do I define clusters based upon pairwise connections where pairs are defined by two columns in a DataFrame?

DSM · Accepted Answer · 2015-07-21 19:41:28Z

You can view this as a set consolidation problem (with each row describing a set) or a connected component problem (with each row describing an edge between two nodes). AFAIK there's no native support for this, although I've considered submitting a PR adding it to the utility tools.

Anyway, you could do something like:

def consolidate(sets):
    # http://rosettacode.org/wiki/Set_consolidation#Python:_Iterative
    setlist = [s for s in sets if s]
    for i, s1 in enumerate(setlist):
        if s1:
            for s2 in setlist[i+1:]:
                intersection = s1.intersection(s2)
                if intersection:
                    s2.update(s1)
                    s1.clear()
                    s1 = s2
    return [s for s in setlist if s]

def group_ids(pairs):
    groups = consolidate(map(set, pairs))
    d = {}
    for i, group in enumerate(sorted(groups)):
        for elem in group:
            d[elem] = i
    return d

after which we have

>>> df["C"] = df["A"].replace(group_ids(zip(df.A, df.B)))
>>> df
   A  B  C
0  1  2  0
1  1  3  0
2  1  4  0
3  2  5  0
4  4  6  0
5  4  7  0
6  8  9  1
7  9  8  1

and you can replace the 0s and 1s by whatever you want.

ajerneck · Accepted Answer · 2015-07-21 19:20:31Z

0

Here is a start (I'm not sure I understood the criteria for grouping into clusters, but, you should be able to add the exact criteria):

import pandas as pd

x = pd.DataFrame({'A': [1,1,1,2,4,4,8,9],
              'B': [2,3,4,5,6,7,9,8]})

## calculate difference between a and be columns
## (substitute any distance/association function)
x['Diff'] = abs(x['A'] - x['B'])

## assign whether row is in a cluster or not.
x['Incluster'] = x['Diff'] <= 1

answered Jul 21, 2015 at 19:20

ajerneck

7611 gold badge7 silver badges20 bronze badges

2 Comments

DrTRD Over a year ago

Clusters are defined by whether or not there is a pairwise connection between the two values. As in my example, (1,2) + (2,5) means (1,5). In addition, there is likely several hundred clusters in my data so binary determinations of cluster-hood will not be sufficient.-

ajerneck Over a year ago

OK, if I understand correctly, the dataframe is an edgelist representing connections in a graph. If so, you can use clustering in graphs: igraph.org/python/doc/igraph.clustering-module.html or networkx.github.io/documentation/latest/reference/generated/…

Collectives™ on Stack Overflow

Grouping Data into Clusters Based on DataFrame Columns

2 Answers 2

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related