Regex matching on full matched substring with constrains in Python

Question

Since it's a regex question. This is a potential duplicated question.

Considering those given strings

test_str = [
    "bla bla google.com bla bla", #0
    "bla bla www.google.com bla bla", #1
    "bla bla api.google.com bla bla", #2
    "google.com", #3
    "www.google.com", #4
    "api.google.com", #5
    "http://google.com", #6
    "http://www.google.com", #7
    "http://api.google.com", #8
    "bla bla http://www.google.com bla bla", #9
    "bla bla https://www.api.google.com bla bla" #10
]

My desired return is google.* or www.google.* but not api.google.*. Which means, in above case, 2, 5, 8, 10 should not return any match.

I have tried several regex, but I can not find a one line regex string for doing this tasks. Here are what I tried.

re.compile("((http[s]?://)?www\.google[a-z.]*)") # match 1,4,7,9
re.compile("((http[s]?://)?google[a-z.]*)") # match all
re.compile("((http[s]?://)?.+\.google[a-z.]*)") # match except 0,3,6
re.compile("((http[s]?://)?!.+\.google[a-z.]*)") # match nothing

Here, I am seeking a way to ignore *.google.* except www.google.* and google.*. But I got stuck while finding a way to get *.google.*.

PS: I have found a O(n**2) way with split() to solve this.

r = re.compile("^((http[s]?://)?www.google[a-z.]*)|^((http[s]?://)?google[a-z.]*)")

for s in test_str:
    for seg in s.split():
        r.findall(seg)

@WiktorStribiżew Thanks. I have an additional question for your answer. If api is not fixed, but I want to filter all of them, such as map.google.*, calendar.google.*. Is this means that I need to add them one by one? — Kir Chou
– Kir Chou, Commented Oct 2, 2017 at 6:55
You may either use the lookbehind approach, and that means you will have to chain the lookbehinds like (?<!\bapi)(?<!\bmap), or you may use a lookahead based approach, like r"(?<!\S)(?!\S*\b(?:map|api))\S*\bgoogle\b\S*" where you may add the blacklisted terms to the alternation group. — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Oct 2, 2017 at 7:02
Thanks for your explanation. Since it is a blacklist way, I am still curious about a whitelist way. — Kir Chou
– Kir Chou, Commented Oct 2, 2017 at 7:04
@WiktorStribiżew Thanks! I believe this is what I want. Your assumptions are correct, http(s) and www are options here. I need to learn more about lookbehinds. Please put your answer below, I will give the answer to you. — Kir Chou
– Kir Chou, Commented Oct 2, 2017 at 7:33

Wiktor Stribiżew · Accepted Answer · 2017-10-02 07:38:20Z

You may use

(?<!\S)(?:https?://)?(?:www\.)?google\.\S*

See the regex demo.

Details

(?<!\S) - a location preceded with a whitespace or start of a string (note that you may also use (?:^|\s) here, to be more explicit)
(?:https?://)? - an optional non-capturing group matching an optional sequence of https:// or http://
(?:www\.)? an optional non-capturing group matching an optional sequence of www.
google\. - a google. substring
\S* - 0+ non-whitespace chars.

Python demo:

import re
test_str = [
    "bla bla google.com bla bla", #0
    "bla bla www.google.com bla bla", #1
    "bla bla api.google.com bla bla", #2
    "google.com", #3
    "www.google.com", #4
    "api.google.com", #5
    "http://google.com", #6
    "http://www.google.com", #7
    "http://api.google.com", #8
    "bla bla http://www.google.com bla bla", #9
    "bla bla https://www.api.google.com bla bla", #10
    "bla bla https://www.map.google.com bla bla" #11
]
r = re.compile(r"(?<!\S)(?:https?://)?(?:www\.)?google\.\S*")
for i,s in enumerate(test_str):
    m = r.search(s)
    if m:
        print("{}\t#{}".format(m.group(0), i))

Output:

google.com  #0
www.google.com  #1
google.com  #3
www.google.com  #4
http://google.com   #6
http://www.google.com   #7
http://www.google.com   #9

Acsor · Accepted Answer · 2017-10-02 08:54:53Z

Had my keyboard been working properly I would have answered a half hour before.

Anyway, I would recommend to not exaggerate the complexity of regexes. You can use the host language to manage black- (and even white-) lists and use the re module auxiliary. Below is what I did all packed inside a script. Obviously you may need some restructuring if you have to integrate this code into a class or function:

import re

def main():
    input_urls = [ 
        "bla bla google.com bla bla",
        "bla bla www.google.com bla bla",
        # ...
    ]   
    filtered_urls = set()

    google_re = re.compile("(\w+\.)?google.com")
    blacklist = set(["api."])   # I didn't research enough to remove the dot

    for url in input_urls:
        # Beware of the difference between match() and search()
        # See https://docs.python.org/3/library/re.html#search-vs-match
        match = google_re.search(url)

        # The second condition will not be evaluated if the first fails
        if match is not None and match.group(1) not in blacklist:
            filtered_urls.add(url)

    print("Accepted URLs:", *filtered_urls, sep="\n\t", end="\n\n")
    print("Blacklisted URLs:", *(set(input_urls).difference(filtered_urls)), sep="\n\t")


if __name__ == "__main__":
    main()

Unfortunately, with my a and h keyboard keys not working, I wasn't able to quickly find a way to remove the dot in the URL location (like in api.google, www.google, calendar.google and so on). I highly recommend to do that.

The output displayed on my console was:

None@vacuum:~$ python3.6 ./filter.py 
Accepted URLs:
    http://google.com
    bla bla google.com bla bla
    bla bla www.google.com bla bla
    http://www.google.com
    google.com
    www.google.com
    bla bla http://www.google.com bla bla

Blacklisted URLs:
    api.google.com
    bla bla api.google.com bla bla
    http://api.google.com
    bla bla https://www.api.google.com bla bla

Collectives™ on Stack Overflow

Regex matching on full matched substring with constrains in Python

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related