clustering strings - what algorithm is suitable?

Question

I have some strings and characters will not be repeated in a single string. for example: "AABC" is not possible.

I want to cluster them into sets by their common sub-strings. for example: "ABC, CDF, GHP" will be cluster into two sets {ABC,CDF},{GHP}.

several strings with one or more common sub-strings will be in one set. a string which has no common sub-string with any other strings will be a set itself. so keep the number of sets smallest.

for example: 1. "ABC, AHD,AKJ,LAN,WER" will be two sets {ABC, AHD,AKJ,LAN},{WER}. 2. "ABC,BDF, HLK, YHT,PX" will be 3 sets {ABC,BDF}.{HLK, YHT},{PX}.

Finding a string which has nothing common with others is easy I think;

for(i=0; i< strings.num; i++)
{  str1 = strings[i];
     bool m_com=false;
     for(j=0;j < strings.num; j++ )
     {
      str2=strings[j];
      if(hascommon(str1,str2))
         m_com=true;
     }
   if(!m_com)
   {
     str1 has no common substring with any string,
   }
}

now I am thinking about others, how to classify them, is there any algorithm suitable for this?

Input: strings (characters are not be repeated)

output: sets (keep number of sets as small as possible)

I know this involves with finding common sub-string problem and clustering. but I am not familiar with clustering techniques, so I am hoping some one could recommend me such algorithm.

while I am looking for good ways to do this, I also appreciate suggestions from others.

Tip: actually these strings are simple paths between two points in a graph. I want to find the edge whose removal cuts all these paths. the number of such edges should be minimum. so, for AB,BC,CD, it means a single path ABCD exist. and I write down a algorithm to find common substrings in my case(my case much simpler). I think I might use this algorithm during the clustering to measure similarities.

I might have two paths, {ABC, ADC}, both removing A or removing B could split the paths. or I could have {ABC, ADC,HG}, so removing {A,H}, or {CH}, or {CG},or {AG} all works.

I thought I could solve this by finding common subs-strings, then I decide where to remove edges.

If a string S is put into cluster A, does that mean that there EXISTS another string S' in cluster A which has a common substring with S or does that mean that ALL other strings A[0], A[1] ... have common substrings with S. Concrete case: AB, BC, CD. AB and BC are put into one cluster. What happens with CD? — David Frank
– David Frank, Commented Feb 4, 2016 at 11:12
all strings in one cluster share one or more common sub-strings. actually these strings are simple paths between two points. I want to find the edge whose removal cuts all these paths. such edge should be smallest. so, for AB,BC,CD, it means a single path ABCD :). I thought I could solve it by finding common parts of paths. — arslan
– arslan, Commented Feb 4, 2016 at 13:05
So if I understand correctly, you have vertices A, B, C etc. and AB is an edge between A and B. You want to find a set of edges, by removing which, your graph is no longer connected. So you basically want to find out the edge-connectivity of a graph, which can be done by finding a maximum flow. See Computational aspects on this page: en.wikipedia.org/wiki/K-edge-connected_graph — David Frank
– David Frank, Commented Feb 4, 2016 at 13:46
In this case, build a graph, add non-removable edges with capacity infinity(you can use the total number of edges to represent infinity) and removable edges with capacity 1. Find a min-cut. If the value of the min-cut is less than infinity(ie. total number of edges), remove the cut edges, otherwise the non-removable edges connect the graph, so there is no solution. — David Frank
– David Frank, Commented Feb 5, 2016 at 20:29
I suggest you taking a look at Boost Graph Library. It implements the function stoer_wagner_min_cut which finds the solution without much hassle. However, you'll have to understand how BGL operates + you will need a good understanding of (maybe not-so) basic graph theory concepts. — David Frank
– David Frank, Commented Feb 5, 2016 at 20:30

WhatsUp · Accepted Answer · 2016-02-04 11:29:40Z

One thing should be pointed out first:

For any two strings, "having common substring" is really equivalent to "having common letter". Thus we can replace the condition by "having common letter".

Consider the graph G whose vertices are the strings, and two strings are connected by an edge if and only if they have a common letter. Then you are really asking for separate the graph G into connected components. This can be done easily, using standard graph operation algorithms, c.f. the wiki page here.

What remains is the task of establishing the graph. This is also easy: first, create 26 boxes, labelled A to Z, and read each string once. If the string contains letter A, then put it (or its index) into box A, etc. Finally, those strings inside one box have edges connecting to each other.

There can be further optimizations, but I guess it will depend on the nature of your input data.

Thanga · Accepted Answer · 2016-02-04 11:14:11Z

0

You have to use Heap's algorithm for your job to create permutations https://en.wikipedia.org/wiki/Heap's_algorithm

answered Feb 4, 2016 at 11:14

Thanga

8,2003 gold badges22 silver badges39 bronze badges

Comments

David Frank · Accepted Answer · 2016-02-04 11:58:03Z

0

As opposed to WhatsUp, I assume you want any two strings in a subset to have a common substring. This means that for AB, BC, CD, {AB, BC, CD} is not a valid solution, because AB and CD do not have a common substring.

As Whatsup already pointed out, you can represent your strings as a graph, where vertices are the strings and and edge goes from one to the other if they have a common character.

If we are not accepting chains (as described at the beginning), the problem becomes finding a minimum clique cover, which is unfortunately NP-complete.

answered Feb 4, 2016 at 11:58

David Frank

6,0928 gold badges30 silver badges43 bronze badges

2 Comments

arslan Over a year ago

actually these strings are paths between two points in a graph. I want to classify these paths into sets, all paths in one set have some shared edges. so by classifying them into several sets (I don't know how many sets in advance), then the shared edges in each set is the edges removing them will cut all these paths in this set. same for other sets, so I think i can cut all paths in way.

arslan Over a year ago

I think my problem is somewhat different than minimum clique over problem. Also I don't care for time complexity right now. so any idea, how to classify these paths into several sets by judging if two paths have shared edges. at last, all paths in one set will have a common sharing edges, such as in the set {AB,AD,AC} has 'A' is shared by all paths.

Collectives™ on Stack Overflow

clustering strings - what algorithm is suitable?

3 Answers 3

Comments

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related