3

How do I get base.php?id=5314 from list?

import urllib.parse
import urllib.request
from bs4 import BeautifulSoup
url = 'http://www.fansubs.ru/search.php'
values = {'Content-Type:' : 'application/x-www-form-urlencoded',
      'query' : 'Boku dake ga Inai Machi' }
d = {}
data = urllib.parse.urlencode(values)
data = data.encode('ascii')
req = urllib.request.Request(url, data)
with urllib.request.urlopen(req) as response:
   the_page = response.read()
soup = BeautifulSoup(the_page, 'html.parser')
for link in soup.findAll('a'):
    d[link] = (link.get('href'))
x = (list(d.values()))
2
  • 1
    What's your problem exactly? Commented Mar 6, 2016 at 13:01
  • It is my understanding that he is looking into all as in a page and wants to filter specific href values... (stored as list in x) Commented Mar 6, 2016 at 13:09

2 Answers 2

1

You can use the build-in function filter in combination with a regex. Example:

import re

# ... your code here ...

x = (list(d.values()))
test = re.compile("base\.php\?id=", re.IGNORECASE)
results = filter(test.search, x)

Update based on comment: You can convert the filter results into a list:

print(list(results))

Example results with the following hard-coded list:

x = ["asd/asd/asd.py", "asd/asd/base.php?id=5314",
     "something/else/here/base.php?id=666"]

You get:

['asd/asd/base.php?id=5314', 'something/else/here/base.php?id=666']

This answer is based on this page which talks about filtering lists. It has few more implementations to do the same thing, that might suit you better. Hope it helps

Sign up to request clarification or add additional context in comments.

1 Comment

If he is simply looking for an exact match using a regex is an overkill. Just use: filter(lambda y: 'base.php?id=' in y.lower(), x). Moreover when using regexes to perform exact matches you should use re.escape to escape the contents instead of doing it yourself, so re.compile(re.escape('base.php?id='), re.IGNORECASE) etc. this is much more important with user-provided inputs.
0

You can pass a regex directly to find_all which will do the filtering for you based on the href with href=re.compile(...:

import re

with urllib.request.urlopen(req) as response:
    the_page = response.read()
    soup = BeautifulSoup(the_page, 'html.parser')
    d = {link:link["href"] for link in soup.find_all('a', href=re.compile(re.escape('base.php?id='))}

find_all will only return the a tags that have a href attribute that matches the regex.

which gives you:

In [21]:d = {link:link["href"] for link in soup.findAll('a', href=re.compile(re.escape('base.php?id='))}

In [22]: d
Out[22]: {<a href="base.php?id=5314">Boku dake ga Inai Machi <small>(ТВ)</small></a>: 'base.php?id=5314'}

Considering you only seem to be looking for one link then it would make more sense just to use find:

In [36]: link = soup.find('a', href=re.compile(re.escape('base.php?id='))

In [37]: link
Out[37]: <a href="base.php?id=5314">Boku dake ga Inai Machi <small>(ТВ)</small></a>

In [38]: link["href"]
Out[38]: 'base.php?id=5314'

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.