0

I have some strings and characters will not be repeated in a single string. for example: "AABC" is not possible.

I want to cluster them into sets by their common sub-strings. for example: "ABC, CDF, GHP" will be cluster into two sets {ABC,CDF},{GHP}.

several strings with one or more common sub-strings will be in one set. a string which has no common sub-string with any other strings will be a set itself. so keep the number of sets smallest.

for example: 1. "ABC, AHD,AKJ,LAN,WER" will be two sets {ABC, AHD,AKJ,LAN},{WER}. 2. "ABC,BDF, HLK, YHT,PX" will be 3 sets {ABC,BDF}.{HLK, YHT},{PX}.

Finding a string which has nothing common with others is easy I think;

for(i=0; i< strings.num; i++)
{  str1 = strings[i];
     bool m_com=false;
     for(j=0;j < strings.num; j++ )
     {
      str2=strings[j];
      if(hascommon(str1,str2))
         m_com=true;
     }
   if(!m_com)
   {
     str1 has no common substring with any string,
   }
}

now I am thinking about others, how to classify them, is there any algorithm suitable for this?

Input: strings (characters are not be repeated)

output: sets (keep number of sets as small as possible)

I know this involves with finding common sub-string problem and clustering. but I am not familiar with clustering techniques, so I am hoping some one could recommend me such algorithm.

while I am looking for good ways to do this, I also appreciate suggestions from others.

Tip: actually these strings are simple paths between two points in a graph. I want to find the edge whose removal cuts all these paths. the number of such edges should be minimum. so, for AB,BC,CD, it means a single path ABCD exist. and I write down a algorithm to find common substrings in my case(my case much simpler). I think I might use this algorithm during the clustering to measure similarities.

I might have two paths, {ABC, ADC}, both removing A or removing B could split the paths. or I could have {ABC, ADC,HG}, so removing {A,H}, or {CH}, or {CG},or {AG} all works.

I thought I could solve this by finding common subs-strings, then I decide where to remove edges.

10
  • 1
    If a string S is put into cluster A, does that mean that there EXISTS another string S' in cluster A which has a common substring with S or does that mean that ALL other strings A[0], A[1] ... have common substrings with S. Concrete case: AB, BC, CD. AB and BC are put into one cluster. What happens with CD? Commented Feb 4, 2016 at 11:12
  • all strings in one cluster share one or more common sub-strings. actually these strings are simple paths between two points. I want to find the edge whose removal cuts all these paths. such edge should be smallest. so, for AB,BC,CD, it means a single path ABCD :). I thought I could solve it by finding common parts of paths. Commented Feb 4, 2016 at 13:05
  • So if I understand correctly, you have vertices A, B, C etc. and AB is an edge between A and B. You want to find a set of edges, by removing which, your graph is no longer connected. So you basically want to find out the edge-connectivity of a graph, which can be done by finding a maximum flow. See Computational aspects on this page: en.wikipedia.org/wiki/K-edge-connected_graph Commented Feb 4, 2016 at 13:46
  • 1
    In this case, build a graph, add non-removable edges with capacity infinity(you can use the total number of edges to represent infinity) and removable edges with capacity 1. Find a min-cut. If the value of the min-cut is less than infinity(ie. total number of edges), remove the cut edges, otherwise the non-removable edges connect the graph, so there is no solution. Commented Feb 5, 2016 at 20:29
  • 1
    I suggest you taking a look at Boost Graph Library. It implements the function stoer_wagner_min_cut which finds the solution without much hassle. However, you'll have to understand how BGL operates + you will need a good understanding of (maybe not-so) basic graph theory concepts. Commented Feb 5, 2016 at 20:30

3 Answers 3

1

One thing should be pointed out first:

For any two strings, "having common substring" is really equivalent to "having common letter". Thus we can replace the condition by "having common letter".

Consider the graph G whose vertices are the strings, and two strings are connected by an edge if and only if they have a common letter. Then you are really asking for separate the graph G into connected components. This can be done easily, using standard graph operation algorithms, c.f. the wiki page here.

What remains is the task of establishing the graph. This is also easy: first, create 26 boxes, labelled A to Z, and read each string once. If the string contains letter A, then put it (or its index) into box A, etc. Finally, those strings inside one box have edges connecting to each other.

There can be further optimizations, but I guess it will depend on the nature of your input data.

Sign up to request clarification or add additional context in comments.

Comments

0

You have to use Heap's algorithm for your job to create permutations https://en.wikipedia.org/wiki/Heap's_algorithm

Comments

0

As opposed to WhatsUp, I assume you want any two strings in a subset to have a common substring. This means that for AB, BC, CD, {AB, BC, CD} is not a valid solution, because AB and CD do not have a common substring.

As Whatsup already pointed out, you can represent your strings as a graph, where vertices are the strings and and edge goes from one to the other if they have a common character.

If we are not accepting chains (as described at the beginning), the problem becomes finding a minimum clique cover, which is unfortunately NP-complete.

2 Comments

actually these strings are paths between two points in a graph. I want to classify these paths into sets, all paths in one set have some shared edges. so by classifying them into several sets (I don't know how many sets in advance), then the shared edges in each set is the edges removing them will cut all these paths in this set. same for other sets, so I think i can cut all paths in way.
I think my problem is somewhat different than minimum clique over problem. Also I don't care for time complexity right now. so any idea, how to classify these paths into several sets by judging if two paths have shared edges. at last, all paths in one set will have a common sharing edges, such as in the set {AB,AD,AC} has 'A' is shared by all paths.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.