How to match exact word with regex python?

Question

I am trying to match exact words with regex but it's not working as I expect it to be. Here's a small example code and data on which I'm trying this. I am trying to match c and java words in a string if found then return true.

I am using this regex \\bc\\b|\\bjava\\b but this is also matching c# which is not what I'm looking for. It should only match that exact word. How can I achieve this?

def match(x):
    if re.match('\\bc\\b|\\bjava\\b', x) is not None:
        return True
    else: return False

print(df)

0                                  c++ c
1            c# silverlight data-binding
2    c# silverlight data-binding columns
3                               jsp jstl
4                              java jdbc
Name: tags, dtype: object

df.tags.apply(match)

0     True
1     True
2     True
3    False
4     True
Name: tags, dtype: bool

Expected Output:

0     True
1    False
2    False
3    False
4     True
Name: tags, dtype: bool

The question was marked as duplicate but the context seems different. @user_12 In case the other question doesn't help the problem is that \b "matches empty string at word boundary (between \w and \W)" and since # is not \w \bc\b matches c#/ — kkawabat
– kkawabat, Commented Aug 29, 2019 at 0:30
@kkawabat Fair enough, reopened the question. You can post an answer if you like. — Selcuk
– Selcuk, Commented Aug 29, 2019 at 0:31
\b considers alphanumeric characters to be word characters. Since # is not alphanumeric, it creates a word boundary, which is why c# matches \bc\b. — Tom Karzes
– Tom Karzes, Commented Aug 29, 2019 at 0:31
@TomKarzes So I should use something like \sc\s|\sjava\s right? I've tried that but it's returning everything as False. If this is not what you meant can you post it as an answer below? — user_12
– user_12, Commented Aug 29, 2019 at 0:35
Yes, except for one thing: \s requires a white space character, so it won't work at the start or the end of the string. So you would need to make those matches optional at the start or end of the string. — Tom Karzes
– Tom Karzes, Commented Aug 29, 2019 at 1:39

blhsing · Accepted Answer · 2019-08-29 00:55:14Z

3

You can use a negative lookbehind and a negative lookahead pattern to ensure that each matching keyword is neither preceded nor followed by a non-space character:

(?<!\S)(?:c|java)(?!\S)

Demo: https://regex101.com/r/GOF8Uo/3

Alternatively, simply split the given string into a list of words and test if any word is in the set of keywords you're looking for:

def match(x):
    return any(w in {'c', 'java'} for w in x.split())

edited Aug 29, 2019 at 0:55

answered Aug 29, 2019 at 0:43

blhsing

109k9 gold badges88 silver badges132 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

user_12 Over a year ago

Thank you. Can I know which method is usually faster i.e (split or regex)? I have around million datapoints and 40k values in the list to check for.

blhsing Over a year ago

You're welcome. Regex is usually much slower than implementations with proper algorithms. See demo: repl.it/repls/BigPunctualLint

Tom Karzes Over a year ago

If you want to speed it up, compile the regular expression (once), then use the compiled version. It's a good habit to always compile regular expressions, with re.compile. I think Python does some caching, but it's faster and more reliable to make it explicit (plus it makes it easy to reuse them elsewhere).

blhsing Over a year ago

@TomKarzes Good point. I've updated my demo accordingly then.

pjaj · Accepted Answer · 2019-08-29 01:06:29Z

0

Have you tried using one of the regex test sites such as this one or this one?? They will analyse your regex patterns and explain exactly what you are actually trying to match. There are many others.

I am not familiar with the python match function, but it appears that it parses your input pattern into

\bc\b|\bjava\b

which matches either 'c' or 'java' at a word boundary. Consequently it will find a 'c' at both ends of "0", the beginning of "1" and "2", return "no match" for "3" and match 'java' in "4" which accounts for your results.

answered Aug 29, 2019 at 1:06

pjaj

2356 silver badges16 bronze badges

Collectives™ on Stack Overflow

How to match exact word with regex python?

2 Answers 2

4 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related