3

Since it's a regex question. This is a potential duplicated question.

Considering those given strings

test_str = [
    "bla bla google.com bla bla", #0
    "bla bla www.google.com bla bla", #1
    "bla bla api.google.com bla bla", #2
    "google.com", #3
    "www.google.com", #4
    "api.google.com", #5
    "http://google.com", #6
    "http://www.google.com", #7
    "http://api.google.com", #8
    "bla bla http://www.google.com bla bla", #9
    "bla bla https://www.api.google.com bla bla" #10
]

My desired return is google.* or www.google.* but not api.google.*. Which means, in above case, 2, 5, 8, 10 should not return any match.


I have tried several regex, but I can not find a one line regex string for doing this tasks. Here are what I tried.

re.compile("((http[s]?://)?www\.google[a-z.]*)") # match 1,4,7,9
re.compile("((http[s]?://)?google[a-z.]*)") # match all
re.compile("((http[s]?://)?.+\.google[a-z.]*)") # match except 0,3,6
re.compile("((http[s]?://)?!.+\.google[a-z.]*)") # match nothing

Here, I am seeking a way to ignore *.google.* except www.google.* and google.*. But I got stuck while finding a way to get *.google.*.


PS: I have found a O(n**2) way with split() to solve this.

r = re.compile("^((http[s]?://)?www.google[a-z.]*)|^((http[s]?://)?google[a-z.]*)")

for s in test_str:
    for seg in s.split():
        r.findall(seg)
6
  • See ideone.com/3Cwfiu Commented Oct 2, 2017 at 6:46
  • @WiktorStribiżew Thanks. I have an additional question for your answer. If api is not fixed, but I want to filter all of them, such as map.google.*, calendar.google.*. Is this means that I need to add them one by one? Commented Oct 2, 2017 at 6:55
  • You may either use the lookbehind approach, and that means you will have to chain the lookbehinds like (?<!\bapi)(?<!\bmap), or you may use a lookahead based approach, like r"(?<!\S)(?!\S*\b(?:map|api))\S*\bgoogle\b\S*" where you may add the blacklisted terms to the alternation group. Commented Oct 2, 2017 at 7:02
  • Thanks for your explanation. Since it is a blacklist way, I am still curious about a whitelist way. Commented Oct 2, 2017 at 7:04
  • 1
    @WiktorStribiżew Thanks! I believe this is what I want. Your assumptions are correct, http(s) and www are options here. I need to learn more about lookbehinds. Please put your answer below, I will give the answer to you. Commented Oct 2, 2017 at 7:33

2 Answers 2

1

You may use

(?<!\S)(?:https?://)?(?:www\.)?google\.\S*

See the regex demo.

Details

  • (?<!\S) - a location preceded with a whitespace or start of a string (note that you may also use (?:^|\s) here, to be more explicit)
  • (?:https?://)? - an optional non-capturing group matching an optional sequence of https:// or http://
  • (?:www\.)? an optional non-capturing group matching an optional sequence of www.
  • google\. - a google. substring
  • \S* - 0+ non-whitespace chars.

Python demo:

import re
test_str = [
    "bla bla google.com bla bla", #0
    "bla bla www.google.com bla bla", #1
    "bla bla api.google.com bla bla", #2
    "google.com", #3
    "www.google.com", #4
    "api.google.com", #5
    "http://google.com", #6
    "http://www.google.com", #7
    "http://api.google.com", #8
    "bla bla http://www.google.com bla bla", #9
    "bla bla https://www.api.google.com bla bla", #10
    "bla bla https://www.map.google.com bla bla" #11
]
r = re.compile(r"(?<!\S)(?:https?://)?(?:www\.)?google\.\S*")
for i,s in enumerate(test_str):
    m = r.search(s)
    if m:
        print("{}\t#{}".format(m.group(0), i))

Output:

google.com  #0
www.google.com  #1
google.com  #3
www.google.com  #4
http://google.com   #6
http://www.google.com   #7
http://www.google.com   #9
Sign up to request clarification or add additional context in comments.

Comments

1

Had my keyboard been working properly I would have answered a half hour before.

Anyway, I would recommend to not exaggerate the complexity of regexes. You can use the host language to manage black- (and even white-) lists and use the re module auxiliary. Below is what I did all packed inside a script. Obviously you may need some restructuring if you have to integrate this code into a class or function:

import re

def main():
    input_urls = [ 
        "bla bla google.com bla bla",
        "bla bla www.google.com bla bla",
        # ...
    ]   
    filtered_urls = set()

    google_re = re.compile("(\w+\.)?google.com")
    blacklist = set(["api."])   # I didn't research enough to remove the dot

    for url in input_urls:
        # Beware of the difference between match() and search()
        # See https://docs.python.org/3/library/re.html#search-vs-match
        match = google_re.search(url)

        # The second condition will not be evaluated if the first fails
        if match is not None and match.group(1) not in blacklist:
            filtered_urls.add(url)

    print("Accepted URLs:", *filtered_urls, sep="\n\t", end="\n\n")
    print("Blacklisted URLs:", *(set(input_urls).difference(filtered_urls)), sep="\n\t")


if __name__ == "__main__":
    main()

Unfortunately, with my a and h keyboard keys not working, I wasn't able to quickly find a way to remove the dot in the URL location (like in api.google, www.google, calendar.google and so on). I highly recommend to do that.

The output displayed on my console was:

None@vacuum:~$ python3.6 ./filter.py 
Accepted URLs:
    http://google.com
    bla bla google.com bla bla
    bla bla www.google.com bla bla
    http://www.google.com
    google.com
    www.google.com
    bla bla http://www.google.com bla bla

Blacklisted URLs:
    api.google.com
    bla bla api.google.com bla bla
    http://api.google.com
    bla bla https://www.api.google.com bla bla

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.