I have some strings and characters will not be repeated in a single string. for example: "AABC" is not possible.
I want to cluster them into sets by their common sub-strings. for example: "ABC, CDF, GHP" will be cluster into two sets {ABC,CDF},{GHP}.
several strings with one or more common sub-strings will be in one set. a string which has no common sub-string with any other strings will be a set itself. so keep the number of sets smallest.
for example: 1. "ABC, AHD,AKJ,LAN,WER" will be two sets {ABC, AHD,AKJ,LAN},{WER}. 2. "ABC,BDF, HLK, YHT,PX" will be 3 sets {ABC,BDF}.{HLK, YHT},{PX}.
Finding a string which has nothing common with others is easy I think;
for(i=0; i< strings.num; i++)
{ str1 = strings[i];
bool m_com=false;
for(j=0;j < strings.num; j++ )
{
str2=strings[j];
if(hascommon(str1,str2))
m_com=true;
}
if(!m_com)
{
str1 has no common substring with any string,
}
}
now I am thinking about others, how to classify them, is there any algorithm suitable for this?
Input: strings (characters are not be repeated)
output: sets (keep number of sets as small as possible)
I know this involves with finding common sub-string problem and clustering. but I am not familiar with clustering techniques, so I am hoping some one could recommend me such algorithm.
while I am looking for good ways to do this, I also appreciate suggestions from others.
Tip: actually these strings are simple paths between two points in a graph. I want to find the edge whose removal cuts all these paths. the number of such edges should be minimum. so, for AB,BC,CD, it means a single path ABCD exist. and I write down a algorithm to find common substrings in my case(my case much simpler). I think I might use this algorithm during the clustering to measure similarities.
I might have two paths, {ABC, ADC}, both removing A or removing B could split the paths. or I could have {ABC, ADC,HG}, so removing {A,H}, or {CH}, or {CG},or {AG} all works.
I thought I could solve this by finding common subs-strings, then I decide where to remove edges.