1

I am trying to match exact words with regex but it's not working as I expect it to be. Here's a small example code and data on which I'm trying this. I am trying to match c and java words in a string if found then return true.

I am using this regex \\bc\\b|\\bjava\\b but this is also matching c# which is not what I'm looking for. It should only match that exact word. How can I achieve this?

def match(x):
    if re.match('\\bc\\b|\\bjava\\b', x) is not None:
        return True
    else: return False

print(df)

0                                  c++ c
1            c# silverlight data-binding
2    c# silverlight data-binding columns
3                               jsp jstl
4                              java jdbc
Name: tags, dtype: object

df.tags.apply(match)

0     True
1     True
2     True
3    False
4     True
Name: tags, dtype: bool

Expected Output:

0     True
1    False
2    False
3    False
4     True
Name: tags, dtype: bool
5
  • 1
    The question was marked as duplicate but the context seems different. @user_12 In case the other question doesn't help the problem is that \b "matches empty string at word boundary (between \w and \W)" and since # is not \w \bc\b matches c#/ Commented Aug 29, 2019 at 0:30
  • @kkawabat Fair enough, reopened the question. You can post an answer if you like. Commented Aug 29, 2019 at 0:31
  • \b considers alphanumeric characters to be word characters. Since # is not alphanumeric, it creates a word boundary, which is why c# matches \bc\b. Commented Aug 29, 2019 at 0:31
  • @TomKarzes So I should use something like \sc\s|\sjava\s right? I've tried that but it's returning everything as False. If this is not what you meant can you post it as an answer below? Commented Aug 29, 2019 at 0:35
  • Yes, except for one thing: \s requires a white space character, so it won't work at the start or the end of the string. So you would need to make those matches optional at the start or end of the string. Commented Aug 29, 2019 at 1:39

2 Answers 2

3

You can use a negative lookbehind and a negative lookahead pattern to ensure that each matching keyword is neither preceded nor followed by a non-space character:

(?<!\S)(?:c|java)(?!\S)

Demo: https://regex101.com/r/GOF8Uo/3

Alternatively, simply split the given string into a list of words and test if any word is in the set of keywords you're looking for:

def match(x):
    return any(w in {'c', 'java'} for w in x.split())
Sign up to request clarification or add additional context in comments.

4 Comments

Thank you. Can I know which method is usually faster i.e (split or regex)? I have around million datapoints and 40k values in the list to check for.
You're welcome. Regex is usually much slower than implementations with proper algorithms. See demo: repl.it/repls/BigPunctualLint
If you want to speed it up, compile the regular expression (once), then use the compiled version. It's a good habit to always compile regular expressions, with re.compile. I think Python does some caching, but it's faster and more reliable to make it explicit (plus it makes it easy to reuse them elsewhere).
@TomKarzes Good point. I've updated my demo accordingly then.
0

Have you tried using one of the regex test sites such as this one or this one?? They will analyse your regex patterns and explain exactly what you are actually trying to match. There are many others.

I am not familiar with the python match function, but it appears that it parses your input pattern into

\bc\b|\bjava\b

which matches either 'c' or 'java' at a word boundary. Consequently it will find a 'c' at both ends of "0", the beginning of "1" and "2", return "no match" for "3" and match 'java' in "4" which accounts for your results.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.