2

I'm a little new to web parsing in python. I am using beautiful soup. I would like to create a list by parsing strings from a webpage. I've looked around and can't seem to find the right answer. Doe anyone know how to create a list of strings from a web page? Any help is appreciated.

My code is something like this:

from BeautifulSoup import BeautifulSoup
import urllib2

url="http://www.any_url.com"
page=urllib2.urlopen(url)
soup = BeautifulSoup(page.read())

#The data I need is coming from HTML tag of td
page_find=soup.findAll('td')

for page_data in page_find:

   print page_data.string

#I tried to create my list here
page_List = [page_data.string]
print page_List
6
  • Your indents didn't come through properly. can you edit your question to fix them? Commented Feb 18, 2014 at 23:20
  • @mhlester Were you talking about my for loop? I just edited. Commented Feb 18, 2014 at 23:24
  • i was, thank you. But the for loop itself is indented further than it should be? Commented Feb 18, 2014 at 23:25
  • @mhlester Ok, fixed it. Sorry about the indents. Commented Feb 18, 2014 at 23:30
  • what are you trying to do? get all the page_data.string values into the page_List? Commented Feb 18, 2014 at 23:33

3 Answers 3

1

Having difficulty understanding what you are trying to achieve... If you want all values of page_data.string in page_List, then your code should look like this:

page_List = []
for page_data in page_find:
    page_List.append(page_data.string)

Or using a list comprehension:

page_List = [page_data.string for page_data in page_find]

The problem with your original code is that you create the list using the text from the last td element only (i.e. outside of the loop which processes each td element).

Sign up to request clarification or add additional context in comments.

1 Comment

Sorry about not being too clear. This appears to work. I'll work with this for a while. Thanks.
1

Here it is modified to call the web page as a string

import requests
the_web_page_as_a_string = requests.get(some_path).content

from lxml import html
myTree = html.fromstring(the_web_page_as_a_string)
td_list = [ e for e in myTree.iter() if e.tag == 'td']


text_list = []
for td_e in td_list:
   text = td_e.text_content()
   text_list.append(text)

3 Comments

tdtext_list = [ e.text_content() for e in myTree.iter() if e.tag == 'td'] ?
Yes that is faster and more efficient but I always prefer to spell it out so a novice can play with each of the parts and understand it and I was trying to decide if their needed to be some error handling if there is no text_content for one or more elements
@PyNEwbi That's right I'm a little bit of a novice. Can you show where a url gets called in that statement? I haven't used lxml before.
1

I'd recommend lxml over BeautifulSoup, when you start scraping alot of pages the speed advantage of lxml is hard to ignore.

import requests
import lxml.html

dom = lxml.html.fromstring(requests.get('http://www.any_url.com').content)
page_list = [x for x in dom.xpath('//td/text()')]
print page_list

1 Comment

Thanks for the code. I'll make a note even though I've already accepted the solution above.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.