Nested tags web scraping python

Question

I am scraping a fixed content from a particular website. The content lies inside a nested div as shown below:

<div class="table-info">
  <div>
    <span>Time</span>
        <div class="overflow-hidden">
            <strong>Full</strong>
        </div>
  </div>
  <div>
    <span>Branch</span>
        <div class="overflow-hidden">
            <strong>IT</strong>
        </div>
  </div>
  <div>
    <span>Type</span>
        <div class="overflow-hidden">
            <strong>Standard</strong>
        </div>
  </div>
  <div>
    <span>contact</span>
        <div class="overflow-hidden">
            <strong>my location</strong>
        </div>
 </div>
</div>

I want to retrieve the only the content of strong inside the div 'overflow-hidden' inside the span with string value Branch. The code i've used is:

from bs4 import BeautifulSoup
import urllib2 
url = urllib2.urlopen("https://www.xyz.com")
content = url.read()
soup = BeautifulSoup(content)
type = soup.find('div',attrs={"class":"table-info"}).findAll('span')
print type

I've scraped all the span content inside the main div 'table-info', so that i can use conditional statement to retrieve the required content. But if i try to scrap the div content inside the span as :

type = soup.find('div',attrs={"class":"table-info"}).findAll('span').find('div')
print type

i get error as:

AttributeError: 'list' object has no attribute 'find'

Can anyone please give me some idea to retrieve content of the div in the span. Thank you. I'm using python2.7

WGS · Accepted Answer · 2014-04-01 12:56:21Z

1

It seems like you want to get the content from second div inside the div-"table-info". However,you are trying to get it using the tag which has no relation to what you are trying toa access.

 type = soup.find('div',attrs={"class":"table-info"}).findAll('span').find('div')

returns error as it is empty.

Better Try this:

from bs4 import BeautifulSoup
import urllib2 
url = urllib2.urlopen("https://www.xyz.com")
content = url.read()
soup = BeautifulSoup(content)
type = soup.find('div',attrs={"class":"table-info"}).findAll('div')
print type[2].find('strong').string

edited Apr 1, 2014 at 12:56

WGS

14.2k5 gold badges50 silver badges51 bronze badges

answered Apr 1, 2014 at 5:25

Anish

1,97011 gold badges29 silver badges48 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

sulav_lfc Over a year ago

Thanks, the code worked. I guess i was following a totally wrong approach for solving the problem.

shaktimaan · Accepted Answer · 2014-04-01 05:04:39Z

The findAll returns a list of BS elements, and find is defined on a BS object, not a list of BS objects, hence the error. Your initial part of the code is fine, Do this instead:

from bs4 import BeautifulSoup
import urllib2 

url = urllib2.urlopen("https://www.xyz.com")
content = url.read()
soup = BeautifulSoup(content)

table = soup.find('div',attrs={"class":"table-info"})
spans = table.findAll('span')
branch_span = span[1]
# Do you manipulation with the branch_span

OR

from bs4 import BeautifulSoup
import urllib2 

url = urllib2.urlopen("https://www.xyz.com")
content = url.read()
soup = BeautifulSoup(content)

table = soup.find('div',attrs={"class":"table-info"})
spans = table.findAll('span')

for span in spans:
    if span.text.lower() == 'branch':
        # Do your manipulation

Collectives™ on Stack Overflow

Nested tags web scraping python

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related