how to use python to get certain string from html

Question

import re
import urllib
web = "http://pic.haibao.com/piclist/2271"
page = urllib.urlopen(web)
html = page.read()
pic_pat =r'src=\("http:\/\/.*?.jpg)'
impat = re.compile(keypat)
keylist = impat.findall(html)

part of the html I get:

 function getList(screen_index) {
        var boxes = [];
        var screen2 = "<li class=\"piclistli\"><div class=\"pic200\"><a href=\"http:\/\/pic.haibao.com\/pic\/12027963.htm\"><img width=\"310\" height=\"465\" src=\"http:\/\/cdn2.hbimg.cn\/store\/tuku\/310_999\/piccommon\/1218\/12188\/D5259EFE8B9999E8FA968CBD38.jpg\" alt=\"\u200b1\u6708\u7684\u7ebd\u7ea6\u4f9d\u7136\u51b7\u51bd\uff0c\u4f46\u578b\u4eba\u4eec\u5e76\u6ca1\u6709\u5929\u6c14\u7684\u6076\u52a3\u800c\u968f\u4fbf\u5957\u4ef6\u8863\u670d\u5c31\u51fa\u95e8\u3002\u5373\u4fbf\u662f\u904d\u5730\u79ef\u96ea\uff0c\u8fd8\u662f\u8981\u7a7f\u4e0a\u6709\u578b\u7684\u5927\u8863\u548c\u9774\u5b50\uff1b\u5929\u6c14\u7070\u6697\u65f6\uff0c\u8fd8\u662f\u8981\u7a7f\u4e0a\u9753\u4e3d\u7684\u8272\u5f69\u6210\u4e3a\u8857\u5934\u660e\u4eae\u7684\u98ce\u666f\u3002\u62a5\u53cb\u4eec\u9a6c\u4e0a\u6765\u7ffb\u7ffb\u770b\u5427\uff01\" \/><\/a><\/div>

I hope to get all the string like:

http:\/\/cdn2.hbimg.cn\/store\/tuku\/310_999\/piccommon\/1218\/12188\/D5259EFE8B9999E8FA968CBD38.jpg

So I use pic_pat =r'src=\("http:\/\/.*?.jpg)', but the string I get is like:

src="http://cdn4.hbimg.cn/store/tuku/310_999/piccommon/1219/12191/D52582CA92C7F0F9E6FF938534.jpg"

How can I get the

src=\"http:\/\/cdn2.hbimg.cn\/store\/tuku\/310_999\/piccommon\/1218\/12188\/D5259EFE8B9999E8FA968CBD38.jpg\"

as string from HTML?

frozenfrog · Accepted Answer · 2015-03-05 09:19:30Z

1

Try BeautifulSoup4

from bs4 import BeautifulSoup as bs
html_doc = bs(html)
img_list = html_doc.find_all('img')
for image in img_list:
    print image.get('src')

After change

edited Mar 5, 2015 at 9:19

answered Mar 5, 2015 at 8:20

frozenfrog

1411 silver badge6 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

x Lu Over a year ago

it return cdn4.hbimg.cn/store/tuku/310_999/piccommon/1220/12204/…

1.618 Over a year ago

Hmm... sorry for doubting. img isn't returned by dir() for some reason, but I see that it does exist.

x Lu Over a year ago

one interesting thing, if i use html, i still cannot get the string i want:i.imgur.com/W8VjJje.png?1

frozenfrog Over a year ago

I have edited my code, is it what you're looking for?

Prateek · Accepted Answer · 2015-03-05 09:37:42Z

0

Use urllib2 instead, which is a pretty cool library to crawl data from webpages.

import urllib2
from lxml import html
url = "Sample url"

html_code = urllib2.urlopen(url)
parsed_source = html.fromstring(html_code) # This will give you html source as string, on which xpath can be applied.
link = parsed_source.xpath("//a/@href")    # This code will return a list of href values on the html source, this Xpath is to be modified as per the html which is displayed in the UI.

This is a sample code how you should approach the problem, as you have to write your own xpath to get the data.

answered Mar 5, 2015 at 9:37

Prateek

1,55613 silver badges23 bronze badges

Collectives™ on Stack Overflow

how to use python to get certain string from html

2 Answers 2

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related