0
import re
import urllib
web = "http://pic.haibao.com/piclist/2271"
page = urllib.urlopen(web)
html = page.read()
pic_pat =r'src=\("http:\/\/.*?.jpg)'
impat = re.compile(keypat)
keylist = impat.findall(html)

part of the html I get:

 function getList(screen_index) {
        var boxes = [];
        var screen2 = "<li class=\"piclistli\"><div class=\"pic200\"><a href=\"http:\/\/pic.haibao.com\/pic\/12027963.htm\"><img width=\"310\" height=\"465\" src=\"http:\/\/cdn2.hbimg.cn\/store\/tuku\/310_999\/piccommon\/1218\/12188\/D5259EFE8B9999E8FA968CBD38.jpg\" alt=\"\u200b1\u6708\u7684\u7ebd\u7ea6\u4f9d\u7136\u51b7\u51bd\uff0c\u4f46\u578b\u4eba\u4eec\u5e76\u6ca1\u6709\u5929\u6c14\u7684\u6076\u52a3\u800c\u968f\u4fbf\u5957\u4ef6\u8863\u670d\u5c31\u51fa\u95e8\u3002\u5373\u4fbf\u662f\u904d\u5730\u79ef\u96ea\uff0c\u8fd8\u662f\u8981\u7a7f\u4e0a\u6709\u578b\u7684\u5927\u8863\u548c\u9774\u5b50\uff1b\u5929\u6c14\u7070\u6697\u65f6\uff0c\u8fd8\u662f\u8981\u7a7f\u4e0a\u9753\u4e3d\u7684\u8272\u5f69\u6210\u4e3a\u8857\u5934\u660e\u4eae\u7684\u98ce\u666f\u3002\u62a5\u53cb\u4eec\u9a6c\u4e0a\u6765\u7ffb\u7ffb\u770b\u5427\uff01\" \/><\/a><\/div>

I hope to get all the string like:

http:\/\/cdn2.hbimg.cn\/store\/tuku\/310_999\/piccommon\/1218\/12188\/D5259EFE8B9999E8FA968CBD38.jpg

So I use pic_pat =r'src=\("http:\/\/.*?.jpg)', but the string I get is like:

src="http://cdn4.hbimg.cn/store/tuku/310_999/piccommon/1219/12191/D52582CA92C7F0F9E6FF938534.jpg"

How can I get the

src=\"http:\/\/cdn2.hbimg.cn\/store\/tuku\/310_999\/piccommon\/1218\/12188\/D5259EFE8B9999E8FA968CBD38.jpg\"

as string from HTML?

2 Answers 2

1

Try BeautifulSoup4

from bs4 import BeautifulSoup as bs
html_doc = bs(html)
img_list = html_doc.find_all('img')
for image in img_list:
    print image.get('src')

After change

Sign up to request clarification or add additional context in comments.

4 Comments

Hmm... sorry for doubting. img isn't returned by dir() for some reason, but I see that it does exist.
one interesting thing, if i use html, i still cannot get the string i want:i.imgur.com/W8VjJje.png?1
I have edited my code, is it what you're looking for?
0

Use urllib2 instead, which is a pretty cool library to crawl data from webpages.

import urllib2
from lxml import html
url = "Sample url"

html_code = urllib2.urlopen(url)
parsed_source = html.fromstring(html_code) # This will give you html source as string, on which xpath can be applied.
link = parsed_source.xpath("//a/@href")    # This code will return a list of href values on the html source, this Xpath is to be modified as per the html which is displayed in the UI.

This is a sample code how you should approach the problem, as you have to write your own xpath to get the data.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.