Java implementation of Aho-Corasick string matching algorithm?

Question

Now I know there have been previous questions regarding this algorithm, however I honestly haven't come across a simple java implementation. Many people have copied and pasted the same code in their GitHub profiles, and its irritating me.

So for the purpose of interview exercise, I've planned to set out and implement the algorithm using a different approach.

The algorithm tend out to be very very challenging. I honestly am lost on how to go about it. The logic just doesn't make sense. I've nearly spent 4 days straight sketching the approach, but to no avail.

Therefore please enlighten us with your wisdom.

I'm primarily doing the algorithm based on this information Intuition behind the Aho-Corasick string matching algorithm

It would be a big bonus if one can implement their own solution here.

But here's the following incomplete and not working solution which I got really stuck at:

If your overwhelemed with the code, the main problem lies at the main algorithm of Aho-Corasick. We already have created the trie tree of dictionaries well.

But the issue is, that now that we have the trie strcuture, how do we actually start implementing.

None of the resources were helpful.

public class DeterminingDNAHealth {
  private Trie tree;
  private String[] dictionary;
  private Node FailedNode;


  private DeterminingDNAHealth() {

  }

  private void buildMatchingMachine(String[] dictionary) {
    this.tree = new Trie();
    this.dictionary = dictionary;

    Arrays.stream(dictionary).forEach(tree::insert);

  }

  private void searchWords(String word, String[] dictionary) {

    buildMatchingMachine(dictionary);

    HashMap < Character, Node > children = tree.parent.getChildren();

    String matchedString = "";

    for (int i = 0; i < 3; i++) {
      char C = word.charAt(i);

      matchedString += C;

      matchedChar(C, matchedString);

    }

  }

  private void matchedChar(char C, String matchedString) {


    if (tree.parent.getChildren().containsKey(C) && dictionaryContains(matchedString)) {

      tree.parent = tree.parent.getChildren().get(C);

    } else {

      char suffix = matchedString.charAt(matchedString.length() - 2);

      if (!tree.parent.getParent().getChildren().containsKey(suffix)) {
        tree.parent = tree.parent.getParent();

      }


    }
  }

  private boolean dictionaryContains(String word) {

    return Arrays.asList(dictionary).contains(word);

  }


  public static void main(String[] args) {

    DeterminingDNAHealth DNA = new DeterminingDNAHealth();

    DNA.searchWords("abccab", new String[] {
      "a",
      "ab",
      "bc",
      "bca",
      "c",
      "caa"
    });


  }
}

I have setup a trie data structure which works fine. So no problem here

trie.java

public class Trie {
  public Node parent;
  public Node fall;

  public Trie() {
    parent = new Node('⍜');
    parent.setParent(new Node());
  }

  public void insert(String word) {...}

  private boolean delete(String word) {...}

  public boolean search(String word) {...}

  public Node searchNode(String word) {...}

  public void printLevelOrderDFS(Node root) {...}

  public static void printLevel(Node node, int level) {...}

  public static int maxHeight(Node root) {...}

  public void printTrie() {...}

}

Same thing for Node.

Node.java

public class Node {

  private char character;
  private Node parent;
  private HashMap<Character, Node> children = new HashMap<Character, Node>();
  private boolean leaf;

  // default case
  public Node() {}

  // constructor accepting the character
  public Node(char character) {
    this.character = character;
  }

  public void setCharacter(char character) {...}

  public char getCharacter() {...}

  public void setParent(Node parent) {...}

  public Node getParent() {...}

  public HashMap<Character, Node> getChildren() {...}

  public void setChildren(HashMap<Character, Node> children) {...}

  public void resetChildren() {...}

  public boolean isLeaf() {...}

  public void setLeaf(boolean leaf) {...}
}

"I honestly haven't come across a simple java implementation" --- "The algorithm tend out to be very very challenging." --- Maybe that's why. — Andreas
– Andreas, Commented Oct 24, 2017 at 23:13
Therefore, I'd like your help to implement a more easier approach so that the majority can understand, right? @Andreas — user7947407
– user7947407, Commented Oct 24, 2017 at 23:14
"The logic just doesn't make sense." --- If you don't understand the algorithm, there's no way for you to code it. — Andreas
– Andreas, Commented Oct 24, 2017 at 23:14
I do understand the theory behind it. But the implementations is confusing @Andreas — user7947407
– user7947407, Commented Oct 24, 2017 at 23:15
Sorry, but you gave yourself a challenge that exceeds your capacity, so now you want us to code it for you? To write a simple implementation of a complex algorithm, something that is probably not even possible, otherwise someone would likely have done it already? This is not a code-writing service. Look for that elsewhere. — Andreas
– Andreas, Commented Oct 24, 2017 at 23:17

templatetypedef · Accepted Answer · 2017-10-25 17:32:29Z

I usually teach a course on advanced data structures every other year, and we cover Aho-Corasick automata when exploring string data structures. There are slides available here that show how to develop the algorithm by optimizing several slower ones.

Generally speaking, I’d break the implementation down into four steps:

Build the trie. At its core, an Aho-Corasick automaton is a trie with some extra arrows tossed in. The first step in the algorithm is to construct this trie, and the good news is that this proceeds just like a regular trie construction. In fact, I’d recommend just implementing this step by pretending you’re just making a trie and without doing anything to anticipate the later steps.
Add suffix (failure) links. This step in the algorithm adds in the important failure links, which the matcher uses whenever it encounters a character that it can’t use to follow a trie edge. The best explanation I have for how these work is in the linked lecture slides. This step of the algorithm is implemented as a breadth-first search walk over the trie. Before you code this one up, I’d recommend working through a few examples by hand to make sure you get the general pattern. Once you do, this isn’t particularly tricky to code up. However, trying to code this up without fully getting how it works is going to make debugging a nightmare!
Add output links. This step is where you add in the links that are used to report all the strings that match at a given node in the trie. It’s implemsnted through a second breadth-first search over the trie, and again, the best explanation I have for how it works is in the slides. The good news is that this step is actually a lot easier to implement than suffix link construction, both because you’ll be more familiar with how to do the BFS and how to walk down and up the trie. Again, don’t attempt to code this up unless you can comfortably do this by hand! You don’t need min code, but you don’t want to get stuck debugging code whose high-level behavior you don’t understand.
Implement the matcher. This step isn’t too bad! You just walk down the trie reading characters from the input, outputting all matches at each step and using the failure links whenever you get stuck and can’t advance downward.

I hope this gives you a more modular task breakdown and a reference about how the whole process is supposed to work. Good luck!

Thank you! Really appreciate for forwarding the slides. It's amazing! I like how you broke down the implementation to few steps. I mean building the trie tree init self is very simple. The tricky part is the failure links and matcher. I'll be studying your slides. Thanks again :)

Jim Mischel · Accepted Answer · 2017-10-25 16:44:32Z

5

You're not going to get a good understanding of the Aho-Corasick string matching algorithm by reading some source code. And you won't find a trivial implementation because the algorithm is non-trivial.

The original paper, Efficient String Matching: An Aid to Bibliographic Search, is well written and quite approachable. I suggest you download that PDF, read it carefully, think about it a bit, and read it again. Study the paper.

You might also find it useful to read others' descriptions of the algorithm. There are many, many pages with text descriptions, diagrams, Powerpoint slides, etc.

You probably want to spend at least a day or two studying those resources before you try to implement it. Because if you try to implement it without fully understanding how it works, you're going to be lost, and your implementation will show it. The algorithm isn't exactly simple, but it's quite approachable.

If you just want some code, there's a good implementation here: https://codereview.stackexchange.com/questions/115624/aho-corasick-for-multiple-exact-string-matching-in-java.

edited Oct 25, 2017 at 16:44

answered Oct 25, 2017 at 15:21

Jim Mischel

135k25 gold badges197 silver badges377 bronze badges

3 Comments

erickson Over a year ago

I expected the link, "Efficient String Matching: An Aid to Bibliographic Search," to point directly to the paper. If you have such a link, it would be great to include.

Jim Mischel Over a year ago

@erickson: Fixed the link. Thanks for the note.

user7947407 Over a year ago

Thank you @JimMischel so much for your answer. Really appreciate it for forwarding the link. I've just a quick skim through and its amazing. I'm going to study and hopefully get the algorithm. :)

Collectives™ on Stack Overflow

Java implementation of Aho-Corasick string matching algorithm?

2 Answers 2

2 Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related