Since it's a regex question. This is a potential duplicated question.
Considering those given strings
test_str = [
"bla bla google.com bla bla", #0
"bla bla www.google.com bla bla", #1
"bla bla api.google.com bla bla", #2
"google.com", #3
"www.google.com", #4
"api.google.com", #5
"http://google.com", #6
"http://www.google.com", #7
"http://api.google.com", #8
"bla bla http://www.google.com bla bla", #9
"bla bla https://www.api.google.com bla bla" #10
]
My desired return is google.* or www.google.* but not api.google.*. Which means, in above case, 2, 5, 8, 10 should not return any match.
I have tried several regex, but I can not find a one line regex string for doing this tasks. Here are what I tried.
re.compile("((http[s]?://)?www\.google[a-z.]*)") # match 1,4,7,9
re.compile("((http[s]?://)?google[a-z.]*)") # match all
re.compile("((http[s]?://)?.+\.google[a-z.]*)") # match except 0,3,6
re.compile("((http[s]?://)?!.+\.google[a-z.]*)") # match nothing
Here, I am seeking a way to ignore *.google.* except www.google.* and google.*. But I got stuck while finding a way to get *.google.*.
PS: I have found a O(n**2) way with split() to solve this.
r = re.compile("^((http[s]?://)?www.google[a-z.]*)|^((http[s]?://)?google[a-z.]*)")
for s in test_str:
for seg in s.split():
r.findall(seg)
apiis not fixed, but I want to filter all of them, such asmap.google.*,calendar.google.*. Is this means that I need to add them one by one?(?<!\bapi)(?<!\bmap), or you may use a lookahead based approach, liker"(?<!\S)(?!\S*\b(?:map|api))\S*\bgoogle\b\S*"where you may add the blacklisted terms to the alternation group.