3

I have very long 5 strings (the number of strings may change).There is no fixed format for these strings. I will provide a number which will indicate the length of the substring. I want to find the matching substrings with the given length. For example the strings are:

      1.     abcabcabc
      2.     abcasdfklop
     

string length: 3

Given these values the output will be something like this:

Match #1:

 Matched string :               "abc"

 Matches in first string:        3

 Matching positions:             0,3,6

 Matches in second string:       1

 Match positions:                0

Match #2:

 Matched string :               "bca"

 Matches in first string:        2

 Matching positions:             1,4

 Matches in second string:       1

 Match    positions:             1

I managed to do it in 4 foreach statement. But it seemed to me too unefficient. Especially if the input sizes are very big.Is there any suggestion or short way to manage this more efficient in c#?

5
  • Try using Regular Expressions. Commented Apr 22, 2013 at 6:23
  • can you please explain a bit more Commented Apr 22, 2013 at 6:27
  • 3
    4 loops seem to be reasonable. Since it is likely to be applicable to very narrow field (DNA sequencing?) don't expect too much from generic forums... Commented Apr 22, 2013 at 6:30
  • 1
    Yes its exactly DNA sequencing actually. it will be a part of a scientific study. So I am trying to give the best result :) Thanks for your comment. Commented Apr 22, 2013 at 6:32
  • You can show us your current codes, so that we can help commenting to see if there is any room for optimization. Commented Apr 22, 2013 at 6:38

4 Answers 4

3

You can do this with a suffix array. (Suffix trees will work fine too, but they require a bit more space, time, and care in implementation.)

Concatenate your two strings, separating them with a character that occurs in neither one. Then build a suffix array. Then you can read off your answer.

Standard suffix arrays give you a lexicographically sorted array of pointers to suffixes of the string together with a "longest common prefix length" array telling you how long the longest common prefix of two lexicographically consecutive suffixes is.

It is fairly straightforward to use the longest common prefix length array to get the information you want; find all maximal subarrays of the longest common prefix length array for which the longest common prefix length is at least the query length, then, for each one that has a match both in the first string and in the second string, report the appropriate prefix and report that it occurs K+1 times, where K is the length of the maximal subarray.

Another approach that's easier to code is to hash all substrings of the appropriate length. You can do this easily with any rolling hash function. Store a dynamic array of pointers into the strings for each hash; once you've hashed all the strings, iterate over all of the hashes that came up and look for matches. You'll need to deal with the false positives somehow; one (probabilistic) approach is to use several hash functions until the false positive probability is acceptably small. Another approach, which is likely only acceptable in the case where you have few matches, is to compare the strings directly.

Sign up to request clarification or add additional context in comments.

Comments

2

If you managed to do this in 4 foreach statements that are not nested then you should be good and you probably don’t need to optimize.

Here is something I’d try. Create a structure that looks something like this

class SubString
{
    string str;
    int position;
}

Divide both strings into all possible substrings and store these into one array. This has a O(n2) complexity.

Now sort these arrays by string length ( O(n*log(n)) complexity) and go through both of these to identify matches.

You’ll need additional structure to hold the results and this probably needs some more tweaking but you see where this is going.

Comments

1

You could use a variant of suffix tree to solve this problem. http://en.wikipedia.org/wiki/Longest_common_substring_problem Also check this out: Algorithm: Find all common substrings between two strings where order is preserved

Comments

0

If using very large strings, memory may become a problem. The code below finds the longest common substring and writes over the variable containing smaller common substrings, but could easily be altered to push the index and length to a list which is then returned as an array of strings.

This is refactored C++ code from Ashutosh Singh at https://iq.opengenus.org/longest-common-substring-using-rolling-hash/ - this will find the substring in O(N * log(N)^2) time and O(N) space

using System;
using System.Collections.Generic;
public class RollingHash
{
    private class RollingHashPowers
    {
        // _mod = prime modulus of polynomial hashing
        // any prime number over a billion should suffice
        internal const int _mod = (int)1e9 + 123;
        // _hashBase = base (point of hashing)
        // this should be a prime number larger than the number of characters used
        // in my use case I am only interested in ASCII (256) characters
        // for strings in languages using non-latin characters, this should be much larger
        internal const long _hashBase = 257;
        // _pow1 = powers of base modulo mod
        internal readonly List<int> _pow1 = new List<int> { 1 };
        // _pow2 = powers of base modulo 2^64
        internal readonly List<long> _pow2 = new List<long> { 1L };

        internal void EnsureLength(int length)
        {
            if (_pow1.Capacity < length)
            {
                _pow1.Capacity = _pow2.Capacity = length;
            }
            for (int currentIndx = _pow1.Count - 1; currentIndx < length; ++currentIndx)
            {
                _pow1.Add((int)(_pow1[currentIndx] * _hashBase % _mod));
                _pow2.Add(_pow2[currentIndx] * _hashBase);
            }
        }
    }

    private class RollingHashedString
    {
        readonly RollingHashPowers _pows;
        readonly int[] _pref1; // Hash on prefix modulo mod
        readonly long[] _pref2; // Hash on prefix modulo 2^64

        // Constructor from string:
        internal RollingHashedString(RollingHashPowers pows, string s, bool caseInsensitive = false)
        {
            _pows = pows;
            _pref1 = new int[s.Length + 1];
            _pref2 = new long[s.Length + 1];

            const long capAVal = 'A';
            const long capZVal = 'Z';
            const long aADif = 'a' - 'A';

            unsafe
            {
                fixed (char* c = s)
                {
                    // Fill arrays with polynomial hashes on prefix
                    for (int i = 0; i < s.Length; ++i)
                    {
                        long v = c[i];
                        if (caseInsensitive && capAVal <= v && v <= capZVal)
                        {
                            v += aADif;
                        }
                        _pref1[i + 1] = (int)((_pref1[i] + v * _pows._pow1[i]) % RollingHashPowers._mod);
                        _pref2[i + 1] = _pref2[i] + v * _pows._pow2[i];
                    }
                }
            }
        }

        // Rollingnomial hash of subsequence [pos, pos+len)
        // If mxPow != 0, value automatically multiply on base in needed power.
        // Finally base ^ mxPow
        internal Tuple<int, long> Apply(int pos, int len, int mxPow = 0)
        {
            int hash1 = _pref1[pos + len] - _pref1[pos];
            long hash2 = _pref2[pos + len] - _pref2[pos];
            if (hash1 < 0)
            {
                hash1 += RollingHashPowers._mod;
            }
            if (mxPow != 0)
            {
                hash1 = (int)((long)hash1 * _pows._pow1[mxPow - (pos + len - 1)] % RollingHashPowers._mod);
                hash2 *= _pows._pow2[mxPow - (pos + len - 1)];
            }
            return Tuple.Create(hash1, hash2);
        }
    }

    private readonly RollingHashPowers _rhp;
    public RollingHash(int longestLength = 0)
    {
        _rhp = new RollingHashPowers();
        if (longestLength > 0)
        {
            _rhp.EnsureLength(longestLength);
        }
    }

    public string FindCommonSubstring(string a, string b, bool caseInsensitive = false)
    {
        // Calculate max neede power of base:
        int mxPow = Math.Max(a.Length, b.Length);
        _rhp.EnsureLength(mxPow);
        // Create hashing objects from strings:
        RollingHashedString hash_a = new RollingHashedString(_rhp, a, caseInsensitive);
        RollingHashedString hash_b = new RollingHashedString(_rhp, b, caseInsensitive);

        // Binary search by length of same subsequence:
        int pos = -1;
        int low = 0;
        int minLen = Math.Min(a.Length, b.Length);
        int high = minLen + 1;
        var tupleCompare = Comparer<Tuple<int, long>>.Default;
        while (high - low > 1)
        {
            int mid = (low + high) / 2;
            List<Tuple<int, long>> hashes = new List<Tuple<int, long>>(a.Length - mid + 1);
            for (int i = 0; i + mid <= a.Length; ++i)
            {
                hashes.Add(hash_a.Apply(i, mid, mxPow));
            }
            hashes.Sort(tupleCompare);
            int p = -1;
            for (int i = 0; i + mid <= b.Length; ++i)
            {
                if (hashes.BinarySearch(hash_b.Apply(i, mid, mxPow), tupleCompare) >= 0)
                {
                    p = i;
                    break;
                }
            }
            if (p >= 0)
            {
                low = mid;
                pos = p;
            }
            else
            {
                high = mid;
            }
        }
        // Output answer:
        return pos >= 0
            ? b.Substring(pos, low)
            : string.Empty;
    }
}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.