1

First off what I am trying to do is ask the user for a search term. The program then searches yahoo and prints out the link of the first result. Here's the code I have so far.

from urllib import urlopen

import re, time
from BeautifulSoup import BeautifulSoup


print "What Would You Like to Search For?"

user_input = raw_input('') #Gets Search Term from User



search = "http://search.yahoo.com/search;_ylt=A2KLtaJX_1BQfT4AwX2bvZx4?p=baker&toggle=1&cop=mss&ei=UTF-8&fr=yfp-t-701" 

new_search = search.replace('baker', user_input)           
content = urlopen( new_search ).read()                       

soupcontent = BeautifulSoup(content)                    


link1 = soupcontent.find(id="link-1")            
print link1

Everything works fine. It takes the user input and searches Yahoo. The problem I'm having is lets say I searched for 'dog'

the program would then print something like this: "a id="link-1" class="yschttl spt" href="http://www.dog.com/" data-bk="5101.1>b>Dog/b> Supplies | b>Dog/b> Food, b>Dog/b> Beds, b>Dog/b> wbr>/wbr>Flea Control & More .../a>"

Which Is indeed the first Link on the page. However I would only like it to print out "http://www.dog.com/" Can anyone help me with this?

Thanks.

2
  • I tried using that However i get this error Commented Sep 13, 2012 at 0:54
  • did you try regular expressions? Commented Oct 6, 2012 at 13:06

3 Answers 3

1

BeautifulSoup actually makes this very easy:

>>> from bs4 import BeautifulSoup
>>> from urllib2 import urlopen
>>> 
>>> url = 'http://search.yahoo.com/search?p=dog'
>>> content = urlopen(url).read()
>>> soup = BeautifulSoup(content)
>>> 
>>> soup.find(id="link-1")
<a class="yschttl spt" data-bk="5097.1" href="http://www.dog.com/" id="link-1"><b>Dog</b> Supplies | <b>Dog</b> Food, <b>Dog</b> Beds, <b>Dog</b> <wbr></wbr>Flea Control &amp; More ...</a>
>>> soup.find(id="link-1").get("href")
'http://www.dog.com/'

With your request for UTF-8 you'll probably see

 u'http://www.dog.com/'

instead, the Unicode version, which is fine too.

Standard warning: be sure to check that Yahoo!'s end-user license permits whatever you want to do, because many licenses rule out certain automated uses.

Sign up to request clarification or add additional context in comments.

1 Comment

Thank You DSM. I'd Being trying to do with Soup for hours. I tried many variations and none of them worked however .get("href") did. Thank you again
1

Try using a regular expression. See: http://docs.python.org/library/re.html.

match = re.search(r'href="(http://.*?)"', str(link1))
print match.group(1)

4 Comments

He wants the http to be printed though so shouldn't it be r'href="(.*?)"' instead?
No I dont have much experience with programming at all. I tried using that but i get this error Traceback (most recent call last): File "scraper.py", line 25, in <module> match = re.search(r'"http://(.*?)"', link1) File "/usr/lib/python2.6/re.py", line 142, in search return _compile(pattern, flags).search(string) TypeError: expected string or buffer
@moretimetocry Pna's answer will also work and is maybe simpler. Using regular expressions can be somewhat tricky.
I like DSM's solution better than mine, so please follow his suggestion.
0

link = your_full_link_string.split('href="')[1].split('"')[0]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.