Regex to find string in list in Python 3

Question

How do I get base.php?id=5314 from list?

import urllib.parse
import urllib.request
from bs4 import BeautifulSoup
url = 'http://www.fansubs.ru/search.php'
values = {'Content-Type:' : 'application/x-www-form-urlencoded',
      'query' : 'Boku dake ga Inai Machi' }
d = {}
data = urllib.parse.urlencode(values)
data = data.encode('ascii')
req = urllib.request.Request(url, data)
with urllib.request.urlopen(req) as response:
   the_page = response.read()
soup = BeautifulSoup(the_page, 'html.parser')
for link in soup.findAll('a'):
    d[link] = (link.get('href'))
x = (list(d.values()))

It is my understanding that he is looking into all as in a page and wants to filter specific href values... (stored as list in x) — urban
– urban, Commented Mar 6, 2016 at 13:09

urban · Accepted Answer · 2016-03-06 13:17:20Z

1

You can use the build-in function filter in combination with a regex. Example:

import re

# ... your code here ...

x = (list(d.values()))
test = re.compile("base\.php\?id=", re.IGNORECASE)
results = filter(test.search, x)

Update based on comment: You can convert the filter results into a list:

print(list(results))

Example results with the following hard-coded list:

x = ["asd/asd/asd.py", "asd/asd/base.php?id=5314",
     "something/else/here/base.php?id=666"]

You get:

['asd/asd/base.php?id=5314', 'something/else/here/base.php?id=666']

This answer is based on this page which talks about filtering lists. It has few more implementations to do the same thing, that might suit you better. Hope it helps

edited Mar 6, 2016 at 13:17

answered Mar 6, 2016 at 13:08

urban

5,7303 gold badges22 silver badges50 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Bakuriu Over a year ago

If he is simply looking for an exact match using a regex is an overkill. Just use: filter(lambda y: 'base.php?id=' in y.lower(), x). Moreover when using regexes to perform exact matches you should use re.escape to escape the contents instead of doing it yourself, so re.compile(re.escape('base.php?id='), re.IGNORECASE) etc. this is much more important with user-provided inputs.

Padraic Cunningham · Accepted Answer · 2016-03-06 17:32:12Z

You can pass a regex directly to find_all which will do the filtering for you based on the href with href=re.compile(...:

import re

with urllib.request.urlopen(req) as response:
    the_page = response.read()
    soup = BeautifulSoup(the_page, 'html.parser')
    d = {link:link["href"] for link in soup.find_all('a', href=re.compile(re.escape('base.php?id='))}

find_all will only return the a tags that have a href attribute that matches the regex.

which gives you:

In [21]:d = {link:link["href"] for link in soup.findAll('a', href=re.compile(re.escape('base.php?id='))}

In [22]: d
Out[22]: {<a href="base.php?id=5314">Boku dake ga Inai Machi <small>(ТВ)</small></a>: 'base.php?id=5314'}

Considering you only seem to be looking for one link then it would make more sense just to use find:

In [36]: link = soup.find('a', href=re.compile(re.escape('base.php?id='))

In [37]: link
Out[37]: <a href="base.php?id=5314">Boku dake ga Inai Machi <small>(ТВ)</small></a>

In [38]: link["href"]
Out[38]: 'base.php?id=5314'

Collectives™ on Stack Overflow

Regex to find string in list in Python 3

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related