4

I am new to Regex. There is data in the format of "(ENTITY A)-[:RELATION {}]->(ENTITY B)", for example, (Canberra)-[:capital_of {}]->(Australia). How can I extract two entities and the relation?

I have tried the following code:

path = "(Canberra)-[:capital_of {}]->(Australia)"
pattern = r'\(.*\)\-\[\:.*\]\-\>\(.*\)'
re.match(pattern,path).group()

But it matches the whole sentence. Any help would be appreciated.

3 Answers 3

5

If you need not use regex, you could use

s="(Canberra)-[:capital_of {}]->(Australia)"
entityA = s[1:].split(')-')[0]
entityB = s.split('->(')[-1][:-1]

The input string is split on the basis of occurrence of the ')-' sub-string and the first part is taken to obtain the first entity.

The split() is done on the basis of the '->(' sub-string and the last split is chosen to obtain the second entity.

So,

print(f'EntityA: {entityA}')
print(f'EntityB: {entityB}')

would give

EntityA: Canberra
EntityB: Australia

Non regex solutions are usually faster.

Edit: Timings as requested in comments.

s="(Canberra)-[:capital_of {}]->(Australia)"
def regex_soln(s):
    pattern = r'\((.*)\)\-\[(:.*)\]\-\>\((.*)\)'
    rv = re.match(pattern,s).groups()
    return rv[0], rv[-1]

def non_regex_soln(s):
    return s[1:].split(')-')[0], s.split('->(')[-1][:-1]

%timeit regex_soln(s)
1.47 µs ± 60.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


%timeit non_regex_soln(s)
619 ns ± 30.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Sign up to request clarification or add additional context in comments.

7 Comments

Could you explain how did you come up with Non regex solutions are usually faster. ? Did you time regex and non-regex solutions. It would be great to see a comparison :)
@DeveshKumarSingh Yeah I timed it. But note that I know very little regex. :-)
Could you add that timing numbers if you can
They are indeed fast, but just 2.5x as fast, also the extra space of the lists generated and the complexity of split might grow as the expression grows more complex Anyways +1 for the timing numbers
As a beginner, I'd prefer an answer which I can understand in a three weeks time over spure speed...
|
2

You are almost there. You need to define each group you want to capture by enclosing it within ().

The code will look like

import re
path = "(Canberra)-[:capital_of {}]->(Australia)"
pattern = r'\((.*)\)\-\[(:.*)\]\-\>\((.*)\)'
print(re.match(pattern,path).groups())

And the output will be

('Canberra', ':capital_of {}', 'Australia')

4 Comments

Does each bracket present a group?
Yes, and you want to isolate each group e.g. (.*) is one group and (:.*) is another
Cool! Glad to help @SiruiLi :) If the answer helped you, please consider marking it as accepted by clicking the tick next to the answer. I would also suggest looking at What should I do when someone answers my question?
Please be more precise - you can very well say not a ) ([^)]+) instead of the dot-star soup ;-) What I mean is 113 steps (regex101.com/r/TUQsvy/1) vs 18 steps (regex101.com/r/TUQsvy/2)
1

This looks like some DSL, a domain specific language, so you might very well write a small parser for it. Here, we use a PEG parser called parsimonious.

You'll need a small grammar and a NodeVisitor class for it:

from parsimonious.grammar import Grammar
from parsimonious.nodes import NodeVisitor

path = "(Canberra)-[:capital_of {}]->(Australia)"

class PathVisitor(NodeVisitor):
    grammar = Grammar(
        r"""
        path    = (pair junk?)+
        pair    = lpar notpar rpar

        lpar    = ~"[(\[]+"
        rpar    = ~"[)\]]+"

        notpar  = ~"[^][()]+"
        junk    = ~"[-:>]+"
        """
    )

    def generic_visit(self, node, visited_children):
        return visited_children or node

    def visit_pair(self, node, visited_children):
        _, value, _ = visited_children
        return value.text

    def visit_path(self, node, visited_children):
        return [child[0] for child in visited_children]

pv = PathVisitor()
output = pv.parse(path)
print(output)

Which will yield

['Canberra', ':capital_of {}', 'Australia']

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.