Understanding the algorithm for pattern matching using an LCP array

Question

Foreword: My question is mainly an algorithmic question, so even if you are not familiar with suffix and LCP arrays you can probably help me.

In this paper it is described how to efficiently use suffix and LCP arrays for string pattern matching.

I understood SA and LCP work and how the algorithm's runtime can be improved from O(P*log(N)) (where P is the length of the pattern and N is length of the string) to O(P+log(N)) (Thanks to Chris Eelmaa's answer here and jogojapans answer here).

I was trying to go through the algorithm in figure 4 which explains the usage of LLcp and RLcp. But I have problems understanding how it works.

The algorithm (taken from the source):

Pattern matching algorithm

Explanation of the used variable names:

lcp(v,w) : Length of the longest common prefix of v and w
W = w0..wP-1 : pattern of length P
A = a0..aN-1 : the text (length N)
Pos[0..N-1] : suffix array
L_W : index (in A) of first occurrence of the matched pattern
M : middle index of current substring
L : lower bound
R : upper bound
Lcp : array of size N-2 such that Lcp[M] = lcp(A_Pos[L_M], A_pos[M]) where L_M is the lower bound of the unique interval with M in the middle
Rcp : array of size N-2 such that Rcp[M] = lcp(A_Pos[R_M], A_pos[M]) where R_M is the upper bound of the unique interval with M in the middle

Now I want to try the algorithm using the following example (partly taken from here):

 SA | LCP | Suffix entry
 -----------------------
 5  | N/A | a  
 3  | 1   | ana  
 1  | 3   | anana  
 0  | 0   | banana  
 4  | 0   | na  
 2  | 2   | nana 

A = "banana" ; N = 6
W = "ban" ; P = 3

I want to try to match a string, say ban and would expect the algorithm to return 0 as L_W.

Here is how I would step through the algorithm:

l = lcp("a", "ban") = 0
r = lcp("nana", "ban") = 0
if 0 = 3 or 'b' =< 'a' then // which is NOT the case for both conditions
    L_W = 0
else if 0 < 3 or 'b' =< 'n' then // which is the case for both conditions 
    L_W = 6 // which means 'not found'
...
...

I feel like I am missing something but I can't find out what. Also I am wondering how the precomputed LCP array can be used instead of calling lcp(v,w).

You have to calculate lcp(v, w) yourself, you can't deduce this from precomputed arrays - any lcp involving the input string will take time. That is where the P in P + logN comes from. As for the step through algorithm - that's interesting. I'll look it one day. — Erti-Chris Eelmaa
– Erti-Chris Eelmaa, Commented Feb 11, 2015 at 11:17
Your error is in that you assume w1 refers to first character in W. It doesn't. It's actually second character in W, which would make it so: 'a' <= 'a' which evaluates to true. — Erti-Chris Eelmaa
– Erti-Chris Eelmaa, Commented Feb 11, 2015 at 11:26
But it is not w1, it is wl with l=0 and therefore the first character in W (i.e. wl = 'b') — Paddre
– Paddre, Commented Feb 12, 2015 at 15:22
yes, you're right. As of this point, I have no explanation. I did take a look at PlogN implementation, and it made sense - the P+logN not so much. — Erti-Chris Eelmaa
– Erti-Chris Eelmaa, Commented Feb 12, 2015 at 15:43

LeeLee · Accepted Answer · 2018-04-15 13:04:32Z

1

I believe there was an error.

First condition is fairly easy to understand. When LCP length == pattern length, it's done. When your pattern is even smaller than or equal to the smallest one, then only choice is the smallest one.

The second condition is wrong. We can prove it by contradiction. r < P || Wr <= a... means r >= P && Wr > a... If r >= P, then how can we have Lw = N(not found), since we already have r length common prefix?

answered Apr 15, 2018 at 13:04

LeeLee

1781 silver badge6 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Paddre Over a year ago

Thanks for the concise answer.

LeeLee Over a year ago

@Paddre After 3 years, you still remember this question. WOW~ Did you successfully run this algorithm? I am having a serious problem. The LCP-LR algorithm from Erti-Chris Eelmaa is actually quite wrong. Firstly, range min from [a,b] = lcp from [a, b+1]. Chris' algorithm is doing range-min instead of lcp. Secondly, the left and the right sub range shall overlap 1unit on the mid point.

Jeremy · Accepted Answer · 2024-08-20 21:09:46Z

This answer takes reference from the following:

Manber, Myers '93
MIT's Suffix Array notes from 2007
@jogojapan's answer
Luke Videckis' implementation, which most of my code is based on

Recently I had been researching on string pattern matching and came across the described technique from various sources, but I hadn't seen a proper implementation of it.

Initially I coded something but it was wrong and I couldn't wrap my head around it. It wasn't until Luke shared his implementation that I finally understood what went wrong. Given the lack of actual code on this, I thought I would just add it here for future reference.

I shall first redefine some variables from OP's snippet. The original was fairly confusing for me.

l (lowercase) - running counter for matched characters from the left of the suffix array
r (lowercase) - running counter for matched characters from the right of the suffix array
L (uppercase) - binary search left bound
R (uppercase) - binary search right bound

Instead of using the LCP-LR array that is described in @jogojapan's answer, we look at MIT's Suffix Array notes from 2007. There, the lcp function is defined to be

lcp(l, r) = min(LCP[l], LCP[l + 1], ..., LCP[r - 1])

With some precomputation of the LCP array, you can build a Sparse Table in O(n) time to answer range minimum queries in O(1) time. Then, we have the following code:

pair<int, int> search(string &t, SparseTable<int> &lcp_st) {
    int match_l = mismatch(t.begin(), t.end(), s.begin() + SA.front(), s.end()).first - t.begin(),
        match_r = mismatch(t.begin(), t.end(), s.begin() + SA.back(), s.end()).first - t.begin(),
        match_m, l = 0, r = s.size(), m;

    auto update_match = [&]() {
        if (SA[m] + match_m < s.size() && s[SA[m] + match_m] > t[match_m]) {
            r = m;
            match_r = match_m;
        } else {
            l = m;
            match_l = match_m;
        }
    };

    auto match_lcp = [&](int m) -> pair<int, int> {
        auto search_half = [&](bool left) {
            int l = left ? -1 : m, r = left ? m : s.size();
            while (l + 1 < r) {
                int mid = l + (r - l) / 2;

                if (lcp_st.range_query(left ? mid : l, left ? r - 1 : mid - 1) < t.size()) (left ? l : r) = mid;
                else (left ? r : l) = mid;
            }
            return r;
        };

        return {search_half(true), search_half(false)};
    };

    while (l + 1 < r) {
        m = l + (r - l) / 2;

        if (match_l < match_r) {
            int lcp = lcp_st.range_query(m, r - 1);

            if (match_r < lcp) r = m;
            else if (match_r > lcp) l = m;
            else {
                match_m = mismatch(t.begin() + match_r, t.end(), s.begin() + SA[m] + match_r, s.end()).first - t.begin();

                if (match_m != t.size()) update_match();
                else return match_lcp(m);
            }
        } else {
            int lcp = lcp_st.range_query(l, m - 1);

            if (match_l < lcp) l = m;
            else if (match_l > lcp) r = m;
            else {
                match_m = mismatch(t.begin() + match_l, t.end(), s.begin() + SA[m] + match_l, s.end()).first - t.begin();

                if (match_m != t.size()) update_match();
                else return match_lcp(m);
            }
        }
    }

    return {l, r}; //dummy pair if no occurrences
}

where the following variables are defined as

s - text
t - pattern
SA - suffix array
lcp_st - minimum sparse table built from LCP array of the suffix array
lcp_st.range_query(l, r - 1) - returns lcp(l, r)
match_l - l (lowercase) as defined above
match_r - r (lowercase) as defined above
match_m - updated match with SA[m], continuing from the last matched character
l - binary search left bound as defined above
r - binary search right bound as defined above
m - binary search middle
update_match() - if the pattern hasn't been fully found, update search
match_lcp(m) - pattern has been found at SA[m], find SA index range of occurrences [l, r)

respectively in my code snippet.

Collectives™ on Stack Overflow

Understanding the algorithm for pattern matching using an LCP array

2 Answers 2

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related