1

I am trying to use RegEx to extract a particular part of some URLs that come in different variations. Here is the generic format:

http://www.blackpages.com/cityName-StateName/mip/part-I-want-to-extract/randomCharacters

sometimes that "mip" part doesn't exist and the URL looks like this:

http://www.blackpages.com/cityName-StateName/part-I-want-to-extract/randomCharacters

I started writing the following RE:

re.compile("blackpages\.com/.*")

the .* matches any character, Now, how do I stop when I encounter a "/" and extract everything that follows before the next "/" is encountered? This would give me the part I want to extract.

1
  • Rakesh, any more concerns? Please feel free to drop a line below my answer. Commented Apr 25, 2017 at 6:40

1 Answer 1

1

You need to use a negated character class:

re.compile(r"blackpages\.com/([^/]*)")
                            ^^^^

The [^/]* will match 0+ chars other than /, as many as possible (greedily).

If you expect at least one char after /, use + quantifier (1 or more occurrences) instead of *.

See the regex demo

Python code:

import re
rx = r"blackpages\.com/([^/]*)"
ss = ["http://www.blackpages.com/cityName-StateName/mip/part-I-want-to-extract/randomCharacters",
"http://www.blackpages.com/cityName-StateName/part-I-want-to-extract/randomCharacters"]
for s in ss:
    m = re.search(rx, s)
    if m:
        print(m.group(1))

Output:

cityName-StateName
cityName-StateName
Sign up to request clarification or add additional context in comments.

1 Comment

Shouldn't you be using capturing groups with that to extract only that part ?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.