1

How do I extract all table contents and preceding data from a Wikipedia page e.g https://en.wikipedia.org/wiki/List_of_birds_of_Trinidad_and_Tobago where the data is in this repeated format,

 <p>
  <b>
   Order
  </b>
  :
  <a class="mw-redirect" href="/wiki/Passeriformes" title="Passeriformes">
   Passeriformes
  </a>
  <span class="nowrap">
  </span>
  <b>
   Family
  </b>
  :
  <a class="mw-redirect" href="/wiki/Passeridae" title="Passeridae">
   Passeridae
  </a>
 </p>
 <p>
  <a href="/wiki/Sparrow" title="Sparrow">
   Sparrows
  </a>
  are small passerine birds ...
 </p>
 <table class="wikitable" width="72%">
  <tr>
   <th width="24%">
    Common name
   </th>
   <th width="24%">
    Binomial
   </th>
   <th width="24%">
    Status
   </th>
  </tr>
  <tr>
   <td>
    <a href="/wiki/House_sparrow" title="House sparrow">
     House sparrow
    </a>
   </td>
   <td>
    <i>
     Passer domesticus
    </i>
   </td>
   <td>
    Trinidad only - Introduced species
   </td>
  </tr>
 </table>

The output format desired is,

Order, Family, Description, Name, Binomial, Status.

0

1 Answer 1

1

Approach:

All wanted tags are siblings of each other. So, basically, you'll have to use find_next_sibling() function to find them.

Explanation:

All the names (titles) of bird types are located inside the <h2> tag. But, the first <h2> tag is for Contents (so skip that). Order and Family are located inside the <p> tag which comes after the <h2> tag. You can find that using h2.find_next_sibling('p'). The table with Name, Binomial, and Status can be found using h2.find_next_sibling('table').

Using all this, you can print all the details you want. But, you'll have to break the loop when you reach the <h2> tag which contains References. This can be done using

if h2.find('span', class_='mw-headline').text == 'References':
    break

Code:

r = requests.get('https://en.wikipedia.org/wiki/List_of_birds_of_Trinidad_and_Tobago')
soup = BeautifulSoup(r.text, 'lxml')

for bird in soup.find_all('h2')[1:]:
    title = bird.find('span', class_='mw-headline').text
    if title == 'References':
        break
    print(title)
    p_tag = bird.find_next_sibling('p')
    order, family = [x.text for x in p_tag.find_all('a')]
    table = p_tag.find_next_sibling('table')
    for row in table.find_all('tr')[1:]:
        name, binomial, status = [x.text for x in row.find_all('td')]
        print(order, family, name, binomial, status, sep=' | ')
    print()

Partial output:

Tinamous
Tinamiformes | Tinamidae | Little tinamou | Crypturellus soui | Trinidad only

Screamers
Anseriformes | Anhimidae | Horned screamer | Anhima cornuta | Trinidad only - rare/accidental

Ducks, geese, and waterfowl
Anseriformes | Anatidae | Fulvous whistling-duck | Dendrocygna bicolor | Trinidad only
Anseriformes | Anatidae | White-faced whistling-duck | Dendrocygna viduata | Trinidad only - rare/accidental
...

...
Waxbills and allies
Passeriformes | Estrildidae | Common waxbill | Estrilda astrild | Trinidad, accidental Tobago - introduced species
Passeriformes | Estrildidae | Tricolored munia | Lonchura malacca | Trinidad only - introduced species

Old World sparrows
Passeriformes | Passeridae | House sparrow | Passer domesticus | Trinidad only - Introduced species
Sign up to request clarification or add additional context in comments.

1 Comment

Beautiful. Well written explanation as well. Thanks.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.