Extract data from HTML in sequence

Question

How do I extract all table contents and preceding data from a Wikipedia page e.g https://en.wikipedia.org/wiki/List_of_birds_of_Trinidad_and_Tobago where the data is in this repeated format,

 <p>
  <b>
   Order
  </b>
  :
  <a class="mw-redirect" href="/wiki/Passeriformes" title="Passeriformes">
   Passeriformes
  </a>
  <span class="nowrap">
  </span>
  <b>
   Family
  </b>
  :
  <a class="mw-redirect" href="/wiki/Passeridae" title="Passeridae">
   Passeridae
  </a>
 </p>
 <p>
  <a href="/wiki/Sparrow" title="Sparrow">
   Sparrows
  </a>
  are small passerine birds ...
 </p>
 <table class="wikitable" width="72%">
  <tr>
   <th width="24%">
    Common name
   </th>
   <th width="24%">
    Binomial
   </th>
   <th width="24%">
    Status
   </th>
  </tr>
  <tr>
   <td>
    <a href="/wiki/House_sparrow" title="House sparrow">
     House sparrow
    </a>
   </td>
   <td>
    <i>
     Passer domesticus
    </i>
   </td>
   <td>
    Trinidad only - Introduced species
   </td>
  </tr>
 </table>

The output format desired is,

Order, Family, Description, Name, Binomial, Status.

Keyur Potdar · Accepted Answer · 2018-04-06 10:39:00Z

Approach:

All wanted tags are siblings of each other. So, basically, you'll have to use find_next_sibling() function to find them.

Explanation:

All the names (titles) of bird types are located inside the <h2> tag. But, the first <h2> tag is for Contents (so skip that). Order and Family are located inside the <p> tag which comes after the <h2> tag. You can find that using h2.find_next_sibling('p'). The table with Name, Binomial, and Status can be found using h2.find_next_sibling('table').

Using all this, you can print all the details you want. But, you'll have to break the loop when you reach the <h2> tag which contains References. This can be done using

if h2.find('span', class_='mw-headline').text == 'References':
    break

Code:

r = requests.get('https://en.wikipedia.org/wiki/List_of_birds_of_Trinidad_and_Tobago')
soup = BeautifulSoup(r.text, 'lxml')

for bird in soup.find_all('h2')[1:]:
    title = bird.find('span', class_='mw-headline').text
    if title == 'References':
        break
    print(title)
    p_tag = bird.find_next_sibling('p')
    order, family = [x.text for x in p_tag.find_all('a')]
    table = p_tag.find_next_sibling('table')
    for row in table.find_all('tr')[1:]:
        name, binomial, status = [x.text for x in row.find_all('td')]
        print(order, family, name, binomial, status, sep=' | ')
    print()

Partial output:

Tinamous
Tinamiformes | Tinamidae | Little tinamou | Crypturellus soui | Trinidad only

Screamers
Anseriformes | Anhimidae | Horned screamer | Anhima cornuta | Trinidad only - rare/accidental

Ducks, geese, and waterfowl
Anseriformes | Anatidae | Fulvous whistling-duck | Dendrocygna bicolor | Trinidad only
Anseriformes | Anatidae | White-faced whistling-duck | Dendrocygna viduata | Trinidad only - rare/accidental
...

...
Waxbills and allies
Passeriformes | Estrildidae | Common waxbill | Estrilda astrild | Trinidad, accidental Tobago - introduced species
Passeriformes | Estrildidae | Tricolored munia | Lonchura malacca | Trinidad only - introduced species

Old World sparrows
Passeriformes | Passeridae | House sparrow | Passer domesticus | Trinidad only - Introduced species

Collectives™ on Stack Overflow

Extract data from HTML in sequence

1 Answer 1

Approach:

Explanation:

Code:

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Approach:

Explanation:

Code:

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related