3

Okay this is a bit of an involved question, but tl;dr it's basically how do you parse an "actual tree" using a "pattern tree"? How do you check if a particular tree instance is matched by a specific pattern tree?

To start, we have the structure of our pattern tree. The pattern tree can generally contain these types of nodes:

  • sequence node: Matches a sequence of items (zero or more).
  • optional node: Matches one or zero items.
  • class node: Delegates to another pattern tree to match.
  • first node: Matches the first child pattern it finds out of a set.
  • interlace node: Matches any of the child patterns in any order.
  • text node: Matches direct text.

That should be good enough for this question. There are a few more node types, but these are the main ones. Essentially it is like a regular expression or grammar tree.

We can start with a simple pattern tree:

<sequence>
  <text value="foo">
    <text value="bar" />
  </text>
</sequence>

This should match any of the following trees of text.

<foo><bar/></foo><foo><bar/></foo><foo><bar/></foo>
<foo><bar/></foo>
<foo><bar/></foo><foo><bar/></foo>

More specifically, you should imagine that as JSON and the pattern tree as JSON as well.

{
  "tag": "foo",
  "children": [
    { "tag": "bar" }
  ]
}

And the sequence pattern tree is like this:

{
  "type": "sequence",
  "children": [
    {
      "type": "text",
      "value": "foo",
      "children": [
        {
          "type": "text",
          "value": "bar"
        }
      ]
    }
  ]
}

A more complex example would be like this for a pattern tree:

matchThing = <text name="thing">
  <text />
  <sequence>
    <first>
      <class name="value"/>
      <class name="function"/>
    </first>
  </sequence>
</text>

matchFunction = <text name="function">
  <text />
  <sequence>
    <text name="input">
      <text />
    </text>
  </sequence>
  <sequence>
    <text name="call">
      <text />
    </text>
  </sequence>
</text>

matchValue = <text name="value">
  <text />
</text>

It would match text like this:

<thing>
  <call-me-anything />
  <function>
    <input><foo/></input>
    <input><bar/></input>
    <call><foo/></call>
    <call><foo/></call>
    <call><bar/></call>
    <call><bar/></call>
  </function>
  <function>
    <call><baz/></call>
  </function>
  <value>
    <hello/>
  </value>
  <function>
    <input><hello/></input>
    <call><world/></input>
  </function>
</thing>

So imagine that as JSON as well.

Wondering how you go about creating an algorithm for this. I am stuck at the beginning, but it seems to require something like recursive descent of some sort, but on trees rather than on sequences.

So you have a function:

function checkIfMatch(actualTree, patternTree) {
  for (node in actualTree) {
    if (!checkIfMatchesSubtree(node, patternTree)) {
      return false
    }
  }
  return true
}

I don't really know how to begin this one, and am searching Google for "tree pattern matching algorithms" or "arbology". Going to take a lot of time to try and translate those mathematical abstractions into code, if I am even going in the right direction. Wondering if you could help construct a simple algorithm for this case. It's hard to figure out how you should be traversing the actual tree, while also traversing the pattern tree, and keeping track of the position you are in each tree.

Spending quite a bit of time on it doesn't get me very far:

function parse(patternTree, textTree) {
  let state = { ti: 0, pi: 0 }
  while (state.ti < textTree.length) {
    let pattern = patternTree[state.pi]
    let result = parsePattern(pattern, textTree[state.ti])
    if (!result) {
      return
    }
  }
}

function parsePattern(patternNode, textNode, state) {
  if (patternNode.type == 'sequence') {
    return parseSequencePattern(patternNode, textNode, state)
  }
}

function parseSequencePattern(patternNode, textNode, state) {
  while (true) {
    let i = 0
    while (i < patternNode.children.length) {
      let childPattern = patternNode.children[i++]
      let result = parsePattern(childPattern, textNode)
      if (!result) {
        return false
      }

    }
  }
}


while (stack.length) {
  let {
    parents,
    node,
  } = stack.shift()

  stack.push({
    parents: [node, ...parents]
  })
}
18
  • 1
    Been some time since I had to deal with it, but I'm fairly sure that XML-schema supports pretty much everything you described. All you'd need would be a XML-to-JSON transpiler to support JSON and a XSD-validator for checking the pattern tree and you're good to go. Plus of course the code to compile whatever format your pattern-tree is defined in into an XML-schema. No need to reinvent the wheel Commented May 7, 2021 at 21:39
  • 1
    I want to learn how the algorithms are implemented, I used XML here as an example but my real project is using a custom data language, not XML. Commented May 7, 2021 at 22:00
  • 1
    Oh fun. Basically this is how object-scan works internally. Take a look at the code if you want inspiration. In particular the find.js file is of interest github.com/blackflux/object-scan/blob/master/src/core/find.js Commented May 7, 2021 at 23:04
  • 1
    I've spent a good deal of time on this issue, working on pattern matching in parse trees and tree refactoring. I learned that writing patterns using tree expressions such as XML or S-expressions is difficult. Instead of pattern matching with expressions that denote whole tree structures, it's much easier to reason and express paths through trees and the relationships between those sets. You should look at XPath. It is a well-defined language for this purpose. Alternatively, if you are just matching trees, you can just produce a sequence of DFS nodes between two nodes, and then compare that. Commented May 7, 2021 at 23:27
  • 2
    Also, you should read papers on computing minimal edit distance in trees. Zhang Shasha is highly readable and understandable. And it'll give you a new way to look at these problems. Commented May 7, 2021 at 23:42

2 Answers 2

1
+500

When attempting to match your pattern p to a tree t, you have to first match the root of p and its children to the root of t and t's children. If p does not match at the root, then the children of t have to be traversed, and each of those child values matched against p.

Based on your posted input and node types, there are three main input scenarios that a recursive pattern matching function has to handle:

  1. p(object) and t(object). Here, the pattern is an object with type and child keys. The input tree is also an object with at minimum tag and children keys. The pattern matching function delegates to a helper function that carries out the match for p's object.type.
  2. p(object) and t(Array). In this case, the tree is an Array of objects, and based on the pattern's object, the code has to decide how tree's Array should be treated (sequence, first, etc).
  3. p(Array) and t(Array). Here, both p and t are in the form of a sequence, and the match operation is to find, for every a in p's Array, a corresponding a1 in t's Array such that pattern_match(a, a1) produces true.

As a start, the code below implements a find_match function that supports the foundational nodes text and sequence:

//object that stores node match handlers
node_types = {'sequence':sequence_match, 'text':text_match}
//main find_match function
function find_match(pattern, tree_root, tree_children=null){
    var tree = tree_children === null? tree_root.children : tree_children;
    if (!Array.isArray(pattern)){ 
        //pattern is an object - delegate to handler for node type
        if (node_types[pattern.type](pattern, tree_root, tree)){
            return true
        }
        //no match at the root - check the children
        return tree.some(x => find_match(pattern, x, null));
    }
    //pattern is an array - delegate to sequence
    return sequence_match({'children':pattern}, tree_root, tree)
}
//text match handler
function text_match(pattern, tree_root, tree_children){
    if (!('value' in pattern) && !('name' in pattern)){
        //empty text node found
        return true
    }
    if (tree_root.tag === pattern['value' in pattern ? 'value' : 'name']){
        //match found at the root - continue match with pattern and tree children
        return find_match(pattern.children, null, tree_root.children)
    } 
    //no match - check the tree's children
    return (tree_children === null ? tree_root.children : tree_children).some(x => find_match(pattern, x))

}
//sequence match handler
function sequence_match(pattern, tree_root, tree_children){
    var tree = tree_children === null ? tree_root.children : tree_children;
    //copy the pattern and tree objects, as "p" is mutated as part of the match process
    var p = JSON.parse(JSON.stringify(pattern))
    var t = JSON.parse(JSON.stringify(tree))
    while (p.children.length){
        if (!t.length){
            return false
        }
        var f = false;
        var f_seq = p.children.shift();
        for (var _n_t of t){
            //compare sub-pattern and sub-tree
            if (find_match(f_seq, _n_t, t)){
                f = true
                break;
            }
        }
        //no match found for sub-pattern in sequence, return false
        if (!f){
            return false
        }
    }
    //matches found for all sub-patterns in the sequence, return true
    return true
}

tree = [{'tag': 'foo', 'children': [{'tag': 'bar', 'children': []}]}, {'tag': 'foo', 'children': [{'tag': 'bar', 'children': []}]}, {'tag': 'foo', 'children': [{'tag': 'bar', 'children': []}]}]
pattern = {'type': 'text', 'name': 'thing', 'children': [{'type': 'text', 'children': []}, {'type': 'sequence', 'children': [{'type': 'first', 'children': [{'type': 'class', 'name': 'value', 'children': []}, {'type': 'class', 'name': 'function', 'children': []}]}]}]}
console.log(find_match(pattern, {'tag':null, 'children':tree}))
//prints "true"
Sign up to request clarification or add additional context in comments.

Comments

1

The easiest way - just to convert your 'pattern tree' to regexp, and then check text representation of your 'actual tree' with that regexp.

Regarding the Recursive descent. Recursive descent itself is enough to perform the grammar check, but not very effective, because sometimes you need to check pattern from beginning multiple times. To make single-pass grammar checker you need state machine as well, and that is what Regexp has under the hood.

So no need to reinvent the wheel, just specify your 'pattern' as regexp (either convert your representation of the pattern to regexp)

Your

<sequence>
  <text value="foo">
    <text value="bar" />
  </text>
</sequence>

Would be converted to

/<foo><bar\/><\/foo>/gm

And that perfectly matches this

<foo><bar/></foo><foo><bar/></foo><foo><bar/></foo>
<foo><bar/></foo>
<foo><bar/></foo><foo><bar/></foo>

I assume algorithm of the conversion pattern -> regexp is straightforward

4 Comments

How do you convert it to a state machine? Is there any resources on that? As I am using a custom language, we don't yet have regular expressions implemented, do I just lookup how to implement a regexp under the hood?
here is good comparison between gradient descent (aka backtracking) and state machine (finite automat) swtch.com/~rsc/regexp/regexp1.html. and description of idea how regexps based on state machine work
Do regular expressions support nesting of sequences though? I need nested sequences as well.
of course they do. (ab+c)+ will match things like abbbcabbbbbcabc

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.