python search different strings on same line

Question

i have the following code that I'd like to optimize:

if re.search(str(stringA), line) and re.search(str(stringB), line):
    .....
    .....

I tried:

stringAB = stringA + '.*' + stringB
if re.search(str(stringAB), line):
    .....
    .....

But the results I get is not reliable. I'm using "re.search" here because it seems to be the only way i can search for the exact regex of the pattern specified in stringA and stringB.

The logic behind this code is modeled after this egrep command example:

stringA=Success
stringB=mysqlDB01

egrep "${stringA}" /var/app/mydata | egrep "${stringB}"

If there's a better way to do this without re.search, please let me know.

What type of object are stringA and stringB? Presumably they aren't actually strings because you're calling str() on them. — PM 2Ring
– PM 2Ring, Commented Jul 15, 2018 at 9:20
they are strings. i'm calling str() to ensure python treats them as strings. and by strings, i mean, any pattern that a user may want to search for in a file. — RoyMWell
– RoyMWell, Commented Jul 15, 2018 at 9:34
If s is already a string then Python already knows it's a string object. str(s) simply returns s. — PM 2Ring
– PM 2Ring, Commented Jul 15, 2018 at 9:37
Are you missing hits because stringA does not always come before stringB? (Which that attempt suggests.) By the way: if x and y should already be optimized as much as possible, so perhaps you are attempting premature optimization here. — Jongware
– Jongware, Commented Jul 15, 2018 at 9:40
It's not possible to make your solution more efficient. It already does the bare minimum amount of work that's required to get the desired result. (Except for needlessly calling str on stringA and stringB.) — Aran-Fey
– Aran-Fey, Commented Jul 15, 2018 at 10:27

Aran-Fey · Accepted Answer · 2018-07-15 10:54:44Z

1

One way to do this is to make a pattern that matches either word (using \b so we only match complete words), use re.findall to check the string for all matches, and then use set equality to ensure that both words have been matched.

import re

stringA = "spam"
stringB = "egg"

words = {stringA, stringB}

# Make a pattern that matches either word
pat = re.compile(r"\b{}\b|\b{}\b".format(stringA, stringB))

data = [
    "this string has spam in it",
    "this string has egg in it",
    "this string has egg in it and another egg too",
    "this string has both egg and spam in it",
    "the word spams shouldn't match",
    "and eggs shouldn't match, either",
]

for s in data:
    found = pat.findall(s)
    print(repr(s), found, set(found) == words)

output

'this string has spam in it' ['spam'] False
'this string has egg in it' ['egg'] False
'this string has egg in it and another egg too' ['egg', 'egg'] False
'this string has both egg and spam in it' ['egg', 'spam'] True
"the word spams shouldn't match" [] False
"and eggs shouldn't match, either" [] False

A slightly more efficent way to do set(found) == words is to use words.issubset(found), since it skips the explicit conversion of found.

As Jon Clements mentions in a comment, we can simplify and generalize the pattern to handle any number of words, and we should use re.escape, just in case any of the words contain regex metacharacters.

pat = re.compile(r"\b({})\b".format("|".join(re.escape(word) for word in words)))

Thanks, Jon!

Here's a version that matches the words in the specified order. If it finds a match it prints the matching substring, otherwise it prints None.

import re

stringA = "spam"
stringB = "egg"
words = [stringA, stringB]

# Make a pattern that matches all the words, in order
pat = r"\b.*?\b".join([re.escape(word) for word in words])
pat = re.compile(r"\b" + pat + r"\b")

data = [
    "this string has spam and also egg, in the proper order",
    "this string has spam in it",
    "this string has spamegg in it",
    "this string has egg in it",
    "this string has egg in it and another egg too",
    "this string has both egg and spam in it",
    "the word spams shouldn't match",
    "and eggs shouldn't match, either",
]

for s in data:
    found = pat.search(s)
    if found:
        found = found.group()
    print('{!r}: {!r}'.format(s, found))

output

'this string has spam and also egg, in the proper order': 'spam and also egg'
'this string has spam in it': None
'this string has spamegg in it': None
'this string has egg in it': None
'this string has egg in it and another egg too': None
'this string has both egg and spam in it': None
"the word spams shouldn't match": None
"and eggs shouldn't match, either": None

edited Jul 15, 2018 at 10:54

Aran-Fey

44k13 gold badges113 silver badges161 bronze badges

answered Jul 15, 2018 at 9:43

PM 2Ring

55.6k6 gold badges96 silver badges201 bronze badges

Sign up to request clarification or add additional context in comments.

12 Comments

Jon Clements Over a year ago

Might be worthwhile generalising pat so it's something like: r'\b{}\b'.format('|'.join(re.escape(word) for word in words)) ?

Jon Clements Over a year ago

Although it doesn't matter here - you could possibly make use of .finditer to avoid creation of a list... eg: words.issubset(m.group() for m in pat.finditer(s))

PM 2Ring Over a year ago

@JonClements Good thinking! I didn't use re.escape originally, since I figured that the strings might already be regexes, but I guess it is a Good Idea. But I won't bother with .finditer, since there's probably not much benefit if the OP is searching single lines of text.

RoyMWell Over a year ago

im actually using with open(logfile) as f, to iterate through a huge log file and to search for the two patterns on the each line of log that was read. the strings must appear in the order specified. stringA then stringB. although, i can imagine a scenario where a user would want it reversed. so i wonder if .finditer could help fasten the process of reading a huge log file and checking each line for the two patterns?

PM 2Ring Over a year ago

@RoyMWell Please see the updated version at the end of my answer. Since you need to search line by line .finditer isn't much benefit here: it's useful when each string to be searched is many kilobytes and contains lots of matches.

|

Collectives™ on Stack Overflow

python search different strings on same line

1 Answer 1

12 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

12 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related