0

I am looking for these urls with regex in a list of html pages, each page with their own unique url shown below

http://sfbay.craigslist.org/search/sfc/apa?
http://sfbay.craigslist.org/search/sfc/apa?s=100
http://sfbay.craigslist.org/search/sfc/apa?s=200
http://sfbay.craigslist.org/search/sfc/apa?s=300

I've tried this regex expression in an attempt to get the first url, as well as the following urls that have a set of strings the first does not

re_search = '(http\:\/\/sfbay\.craigslist\.org\/search\/sfc\/apa\?(s\=\d+)?)'
searched_urls = re.findall(re_search, str(search_page_html))
searched_urls
  • search_page_html, is the list of html pages

It gives this result, but I only want the first result of each tuple.

('http://sfbay.craigslist.org/search/sfc/apa?', ''),
('http://sfbay.craigslist.org/search/sfc/apa?s=100', 's=100'),
('http://sfbay.craigslist.org/search/sfc/apa?s=200', 's=200'),
('http://sfbay.craigslist.org/search/sfc/apa?s=300', 's=300'),

Thanks in advance!

0

1 Answer 1

2

In regex, a pair of parenthesis will capture a match. You have two pairs of parenthesis, and therefore two matches in each tuple.

(s\=\d+)

is capturing the '', 's=100', 's=200', and 's=300'. You can change that group to a non capturing group by adding ?: to the beginning of the parenthesis, like so:

(?:s\=\d+)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.