0

i have the following code that I'd like to optimize:

if re.search(str(stringA), line) and re.search(str(stringB), line):
    .....
    .....

I tried:

stringAB = stringA + '.*' + stringB
if re.search(str(stringAB), line):
    .....
    .....

But the results I get is not reliable. I'm using "re.search" here because it seems to be the only way i can search for the exact regex of the pattern specified in stringA and stringB.

The logic behind this code is modeled after this egrep command example:

stringA=Success
stringB=mysqlDB01

egrep "${stringA}" /var/app/mydata | egrep "${stringB}"

If there's a better way to do this without re.search, please let me know.

6
  • What type of object are stringA and stringB? Presumably they aren't actually strings because you're calling str() on them. Commented Jul 15, 2018 at 9:20
  • they are strings. i'm calling str() to ensure python treats them as strings. and by strings, i mean, any pattern that a user may want to search for in a file. Commented Jul 15, 2018 at 9:34
  • 1
    If s is already a string then Python already knows it's a string object. str(s) simply returns s. Commented Jul 15, 2018 at 9:37
  • 2
    Are you missing hits because stringA does not always come before stringB? (Which that attempt suggests.) By the way: if x and y should already be optimized as much as possible, so perhaps you are attempting premature optimization here. Commented Jul 15, 2018 at 9:40
  • 1
    It's not possible to make your solution more efficient. It already does the bare minimum amount of work that's required to get the desired result. (Except for needlessly calling str on stringA and stringB.) Commented Jul 15, 2018 at 10:27

1 Answer 1

1

One way to do this is to make a pattern that matches either word (using \b so we only match complete words), use re.findall to check the string for all matches, and then use set equality to ensure that both words have been matched.

import re

stringA = "spam"
stringB = "egg"

words = {stringA, stringB}

# Make a pattern that matches either word
pat = re.compile(r"\b{}\b|\b{}\b".format(stringA, stringB))

data = [
    "this string has spam in it",
    "this string has egg in it",
    "this string has egg in it and another egg too",
    "this string has both egg and spam in it",
    "the word spams shouldn't match",
    "and eggs shouldn't match, either",
]

for s in data:
    found = pat.findall(s)
    print(repr(s), found, set(found) == words)   

output

'this string has spam in it' ['spam'] False
'this string has egg in it' ['egg'] False
'this string has egg in it and another egg too' ['egg', 'egg'] False
'this string has both egg and spam in it' ['egg', 'spam'] True
"the word spams shouldn't match" [] False
"and eggs shouldn't match, either" [] False

A slightly more efficent way to do set(found) == words is to use words.issubset(found), since it skips the explicit conversion of found.


As Jon Clements mentions in a comment, we can simplify and generalize the pattern to handle any number of words, and we should use re.escape, just in case any of the words contain regex metacharacters.

pat = re.compile(r"\b({})\b".format("|".join(re.escape(word) for word in words)))

Thanks, Jon!


Here's a version that matches the words in the specified order. If it finds a match it prints the matching substring, otherwise it prints None.

import re

stringA = "spam"
stringB = "egg"
words = [stringA, stringB]

# Make a pattern that matches all the words, in order
pat = r"\b.*?\b".join([re.escape(word) for word in words])
pat = re.compile(r"\b" + pat + r"\b")

data = [
    "this string has spam and also egg, in the proper order",
    "this string has spam in it",
    "this string has spamegg in it",
    "this string has egg in it",
    "this string has egg in it and another egg too",
    "this string has both egg and spam in it",
    "the word spams shouldn't match",
    "and eggs shouldn't match, either",
]

for s in data:
    found = pat.search(s)
    if found:
        found = found.group()
    print('{!r}: {!r}'.format(s, found))

output

'this string has spam and also egg, in the proper order': 'spam and also egg'
'this string has spam in it': None
'this string has spamegg in it': None
'this string has egg in it': None
'this string has egg in it and another egg too': None
'this string has both egg and spam in it': None
"the word spams shouldn't match": None
"and eggs shouldn't match, either": None
Sign up to request clarification or add additional context in comments.

12 Comments

Might be worthwhile generalising pat so it's something like: r'\b{}\b'.format('|'.join(re.escape(word) for word in words)) ?
Although it doesn't matter here - you could possibly make use of .finditer to avoid creation of a list... eg: words.issubset(m.group() for m in pat.finditer(s))
@JonClements Good thinking! I didn't use re.escape originally, since I figured that the strings might already be regexes, but I guess it is a Good Idea. But I won't bother with .finditer, since there's probably not much benefit if the OP is searching single lines of text.
im actually using with open(logfile) as f, to iterate through a huge log file and to search for the two patterns on the each line of log that was read. the strings must appear in the order specified. stringA then stringB. although, i can imagine a scenario where a user would want it reversed. so i wonder if .finditer could help fasten the process of reading a huge log file and checking each line for the two patterns?
@RoyMWell Please see the updated version at the end of my answer. Since you need to search line by line .finditer isn't much benefit here: it's useful when each string to be searched is many kilobytes and contains lots of matches.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.