1

Trying to find multiple word match in given text.For example :

text = "oracle sql"
regex = "(oracle\\ sql|sql)"
re.findall(regex,text,re.I)

Output actual

oracle sql

Expected output

oracle sql,sql

Can anyone tell me, where is problem with regex expression ?

Updated:

@jim it won't work ,if multiple overlapping comes, for example :

re.findall("(?=(spark|spark sql|sql))","spark sql",re.I)

Actual Output

['spark','sql']

Expected Output :

['spark','sql','spark sql']]

Note : In the above case if both are matched then it won't match combination of words.

Updated :

Check link : repl.it/repls/NewFaithfulMath

8
  • Which version of python are you using? I'm getting findall() got an unexpected keyword argument 'flag' Commented Aug 15, 2018 at 16:43
  • @jim i have removed that flag,check now Commented Aug 15, 2018 at 16:44
  • @jim python-2.7 Commented Aug 15, 2018 at 16:44
  • Possible duplicate of How to find overlapping matches with a regexp? Commented Aug 16, 2018 at 8:08
  • @UnbearableLightness My major point is how to get overlapping matched words also how it can be duplicate.Can you give a try on this :- re.findall("(?=(spark|spark sql|sql))","spark sql",re.I) Commented Aug 16, 2018 at 9:05

1 Answer 1

3

You don't need to escape whitespace.

import re
text = "oracle sql"
regex = "(oracle sql|sql)"
print re.findall(regex, text, re.I)

From the documentation:

Return all non-overlapping matches of pattern in string, as a list of strings.

This counts as an overlapping match.

Returning overlapping matches

You can use a lookahead to capture the string you're looking for, but because it's technically matching on the lookahead they are not overlapping.

import re
text = "oracle sql"
regex = "(?=(oracle sql|sql))"
print re.findall(regex, text, re.I)

Output:

['oracle sql', 'sql']

See it in action.

The downside of this implementation is that it will only find 1 match for each word at a particular position in a string. This is due to overlapping matches.

For example (my test|my|test) will only find ['my test', 'test'].

You could always use a regex replacement that will find overlapping matches too, such as regex, but this will still only find ['my test', 'test'] with the pattern (my test|my|test):

import regex as re
text = "oracle sql"
regex = "(oracle sql|sql)"
print re.findall(regex, text, re.I, overlapped=True)

Recursion

Regex will only find one match per character. It has already found the match for the first character based on "oracle sql" so you can't get a match on just oracle. You can't find every single one.

However... You could use a recursive function to try to match the same string with all of the items - what has already been matched.

I am not sure how performant this code will be as you could execute a lot of regex searches.

import re

def find_all_matches(text, items):
  regex_items = '|'.join(items)
  regex = "(?=({}))".format(regex_items)
  matches = re.findall(regex, text, re.I)
  new_items = [i for i in items if i not in matches]
  if new_items:
    new_matches = find_all_matches(text, new_items)
    return matches + new_matches
  return matches
print find_all_matches("oracle sql", ['oracle sql', 'oracle', 'sql'])

Output:

['oracle sql', 'sql', 'oracle']

No regex

Lastly you could implement this without regex. Again I haven't looked at the performance of this.

def find_all_matches(text, items):
  return [i for i in items if i in text]

print find_all_matches("oracle sql", ['oracle sql', 'oracle', 'sql'])

Output:

['oracle sql', 'oracle', 'sql']
Sign up to request clarification or add additional context in comments.

5 Comments

You don't need to escape whitespace.
@jim this won't work if we put : re.findall("(?=(spark|spark sql|sql))","spark sql",re.I)
@jim Can you check in this : repl.it/repls/NewFaithfulMath, why not working ?
@Arpit Regex will only find one match per character. It has already found the match for the first character based on "oracle sql". You can't find every single one.
@JimWright Right now i did the same but i didn't find any suitable regex for this

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.