regex: get string and optional extra string in python

Question

I am looking for these urls with regex in a list of html pages, each page with their own unique url shown below

http://sfbay.craigslist.org/search/sfc/apa?
http://sfbay.craigslist.org/search/sfc/apa?s=100
http://sfbay.craigslist.org/search/sfc/apa?s=200
http://sfbay.craigslist.org/search/sfc/apa?s=300

I've tried this regex expression in an attempt to get the first url, as well as the following urls that have a set of strings the first does not

re_search = '(http\:\/\/sfbay\.craigslist\.org\/search\/sfc\/apa\?(s\=\d+)?)'
searched_urls = re.findall(re_search, str(search_page_html))
searched_urls

search_page_html, is the list of html pages

It gives this result, but I only want the first result of each tuple.

('http://sfbay.craigslist.org/search/sfc/apa?', ''),
('http://sfbay.craigslist.org/search/sfc/apa?s=100', 's=100'),
('http://sfbay.craigslist.org/search/sfc/apa?s=200', 's=200'),
('http://sfbay.craigslist.org/search/sfc/apa?s=300', 's=300'),

Thanks in advance!

mareoraft · Accepted Answer · 2015-02-15 00:18:39Z

2

In regex, a pair of parenthesis will capture a match. You have two pairs of parenthesis, and therefore two matches in each tuple.

(s\=\d+)

is capturing the '', 's=100', 's=200', and 's=300'. You can change that group to a non capturing group by adding ?: to the beginning of the parenthesis, like so:

(?:s\=\d+)

answered Feb 15, 2015 at 0:18

mareoraft

3,9986 gold badges34 silver badges68 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

regex: get string and optional extra string in python

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related