How to extract multiple strings by using regex

Question

I am new to Regex. There is data in the format of "(ENTITY A)-[:RELATION {}]->(ENTITY B)", for example, (Canberra)-[:capital_of {}]->(Australia). How can I extract two entities and the relation?

I have tried the following code:

path = "(Canberra)-[:capital_of {}]->(Australia)"
pattern = r'\(.*\)\-\[\:.*\]\-\>\(.*\)'
re.match(pattern,path).group()

But it matches the whole sentence. Any help would be appreciated.

J...S · Accepted Answer · 2019-06-05 06:22:02Z

5

If you need not use regex, you could use

s="(Canberra)-[:capital_of {}]->(Australia)"
entityA = s[1:].split(')-')[0]
entityB = s.split('->(')[-1][:-1]

The input string is split on the basis of occurrence of the ')-' sub-string and the first part is taken to obtain the first entity.

The split() is done on the basis of the '->(' sub-string and the last split is chosen to obtain the second entity.

So,

print(f'EntityA: {entityA}')
print(f'EntityB: {entityB}')

would give

EntityA: Canberra
EntityB: Australia

Non regex solutions are usually faster.

Edit: Timings as requested in comments.

s="(Canberra)-[:capital_of {}]->(Australia)"
def regex_soln(s):
    pattern = r'\((.*)\)\-\[(:.*)\]\-\>\((.*)\)'
    rv = re.match(pattern,s).groups()
    return rv[0], rv[-1]

def non_regex_soln(s):
    return s[1:].split(')-')[0], s.split('->(')[-1][:-1]

%timeit regex_soln(s)
1.47 µs ± 60.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


%timeit non_regex_soln(s)
619 ns ± 30.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

edited Jun 5, 2019 at 6:22

answered Jun 5, 2019 at 5:43

J...S

5,2672 gold badges25 silver badges39 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Devesh Kumar Singh Over a year ago

Could you explain how did you come up with Non regex solutions are usually faster. ? Did you time regex and non-regex solutions. It would be great to see a comparison :)

J...S Over a year ago

@DeveshKumarSingh Yeah I timed it. But note that I know very little regex. :-)

Devesh Kumar Singh Over a year ago

Could you add that timing numbers if you can

Devesh Kumar Singh Over a year ago

They are indeed fast, but just 2.5x as fast, also the extra space of the lists generated and the complexity of split might grow as the expression grows more complex Anyways +1 for the timing numbers

Jan Over a year ago

As a beginner, I'd prefer an answer which I can understand in a three weeks time over spure speed...

|

Devesh Kumar Singh · Accepted Answer · 2019-06-05 05:19:51Z

2

You are almost there. You need to define each group you want to capture by enclosing it within ().

The code will look like

import re
path = "(Canberra)-[:capital_of {}]->(Australia)"
pattern = r'\((.*)\)\-\[(:.*)\]\-\>\((.*)\)'
print(re.match(pattern,path).groups())

And the output will be

('Canberra', ':capital_of {}', 'Australia')

answered Jun 5, 2019 at 5:19

Devesh Kumar Singh

20.5k5 gold badges25 silver badges43 bronze badges

4 Comments

Sirui Li Over a year ago

Does each bracket present a group?

Devesh Kumar Singh Over a year ago

Yes, and you want to isolate each group e.g. (.*) is one group and (:.*) is another

Devesh Kumar Singh Over a year ago

Cool! Glad to help @SiruiLi :) If the answer helped you, please consider marking it as accepted by clicking the tick next to the answer. I would also suggest looking at What should I do when someone answers my question?

Jan Over a year ago

Please be more precise - you can very well say not a ) ([^)]+) instead of the dot-star soup ;-) What I mean is 113 steps (regex101.com/r/TUQsvy/1) vs 18 steps (regex101.com/r/TUQsvy/2)

Jan · Accepted Answer · 2019-06-05 06:53:10Z

This looks like some DSL, a domain specific language, so you might very well write a small parser for it. Here, we use a PEG parser called parsimonious.

You'll need a small grammar and a NodeVisitor class for it:

from parsimonious.grammar import Grammar
from parsimonious.nodes import NodeVisitor

path = "(Canberra)-[:capital_of {}]->(Australia)"

class PathVisitor(NodeVisitor):
    grammar = Grammar(
        r"""
        path    = (pair junk?)+
        pair    = lpar notpar rpar

        lpar    = ~"[(\[]+"
        rpar    = ~"[)\]]+"

        notpar  = ~"[^][()]+"
        junk    = ~"[-:>]+"
        """
    )

    def generic_visit(self, node, visited_children):
        return visited_children or node

    def visit_pair(self, node, visited_children):
        _, value, _ = visited_children
        return value.text

    def visit_path(self, node, visited_children):
        return [child[0] for child in visited_children]

pv = PathVisitor()
output = pv.parse(path)
print(output)

Which will yield

['Canberra', ':capital_of {}', 'Australia']

Collectives™ on Stack Overflow

How to extract multiple strings by using regex

3 Answers 3

7 Comments

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

7 Comments

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related