create list from parsed web page in python

Question

I'm a little new to web parsing in python. I am using beautiful soup. I would like to create a list by parsing strings from a webpage. I've looked around and can't seem to find the right answer. Doe anyone know how to create a list of strings from a web page? Any help is appreciated.

My code is something like this:

from BeautifulSoup import BeautifulSoup
import urllib2

url="http://www.any_url.com"
page=urllib2.urlopen(url)
soup = BeautifulSoup(page.read())

#The data I need is coming from HTML tag of td
page_find=soup.findAll('td')

for page_data in page_find:

   print page_data.string

#I tried to create my list here
page_List = [page_data.string]
print page_List

Your indents didn't come through properly. can you edit your question to fix them? — mhlester
– mhlester, Commented Feb 18, 2014 at 23:20
@mhlester Were you talking about my for loop? I just edited. — David
– David, Commented Feb 18, 2014 at 23:24
i was, thank you. But the for loop itself is indented further than it should be? — mhlester
– mhlester, Commented Feb 18, 2014 at 23:25
what are you trying to do? get all the page_data.string values into the page_List? — isedev
– isedev, Commented Feb 18, 2014 at 23:33

isedev · Accepted Answer · 2014-02-18 23:36:25Z

1

Having difficulty understanding what you are trying to achieve... If you want all values of page_data.string in page_List, then your code should look like this:

page_List = []
for page_data in page_find:
    page_List.append(page_data.string)

Or using a list comprehension:

page_List = [page_data.string for page_data in page_find]

The problem with your original code is that you create the list using the text from the last td element only (i.e. outside of the loop which processes each td element).

answered Feb 18, 2014 at 23:36

isedev

19.7k3 gold badges65 silver badges60 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

David Over a year ago

Sorry about not being too clear. This appears to work. I'll work with this for a while. Thanks.

PyNEwbie · Accepted Answer · 2014-02-19 00:15:44Z

1

Here it is modified to call the web page as a string

import requests
the_web_page_as_a_string = requests.get(some_path).content

from lxml import html
myTree = html.fromstring(the_web_page_as_a_string)
td_list = [ e for e in myTree.iter() if e.tag == 'td']


text_list = []
for td_e in td_list:
   text = td_e.text_content()
   text_list.append(text)

edited Feb 19, 2014 at 0:15

answered Feb 18, 2014 at 23:31

PyNEwbie

4,9707 gold badges48 silver badges88 bronze badges

3 Comments

isedev Over a year ago

tdtext_list = [ e.text_content() for e in myTree.iter() if e.tag == 'td'] ?

PyNEwbie Over a year ago

Yes that is faster and more efficient but I always prefer to spell it out so a novice can play with each of the parts and understand it and I was trying to decide if their needed to be some error handling if there is no text_content for one or more elements

David Over a year ago

@PyNEwbi That's right I'm a little bit of a novice. Can you show where a url gets called in that statement? I haven't used lxml before.

cheekybastard · Accepted Answer · 2014-02-19 00:28:36Z

1

I'd recommend lxml over BeautifulSoup, when you start scraping alot of pages the speed advantage of lxml is hard to ignore.

import requests
import lxml.html

dom = lxml.html.fromstring(requests.get('http://www.any_url.com').content)
page_list = [x for x in dom.xpath('//td/text()')]
print page_list

answered Feb 19, 2014 at 0:28

cheekybastard

5,7653 gold badges25 silver badges26 bronze badges

1 Comment

David Over a year ago

Thanks for the code. I'll make a note even though I've already accepted the solution above.

Collectives™ on Stack Overflow

create list from parsed web page in python

3 Answers 3

1 Comment

3 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

3 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related