Scraping data with Python and Pandas

Question

I'm trying to obtain a table of results with this code:

import pandas as pd
url = 'https://www.betfair.co.uk/sport/football'
df = pd.read_html(url, header = None)
df[0]

The url may vary if you are not in the UK.

I thought it would be like this bit of code, which works perfectly (I get the table) for me.

import pandas as pd
url = 'https://en.wikipedia.org/wiki/Opinion_polling_for_the_French_presidential_election,_2017'
df = pd.read_html(url, skiprows=3) 
df[0]

In the first example, the html is organized around <ul>, and <li>.

In the second, it's a table proper with .

How can I tweak pandas to obtain the data in the first case?

Chris Thompson · Accepted Answer · 2017-04-11 18:26:20Z

Unfortunately, pandas.read_html (docs) only extracts data from HTML tables:

import pandas as pd
html = '''<html>
            <body>
              <table>
                <tr>
                  <th>Col1</th>
                  <th>Col2</th>
                </tr>
                <tr>
                  <td>Val1</td>
                  <td>Val2</td>
                </tr>
              </table>
            </body>
          </html>'''
dfs = pd.read_html(html)
df[0]

Output:

   0     1
0  Col1  Col2
1  Val1  Val2

For the second case where we the HTML contains an unordered list instead, the existing pandas function won't work. You can instead parse the list (and all of it's children) using an HTML parsing library like BeautifulSoup4 and build up the dataframe row-by-row. Here's a simple example:

import pandas as pd
from bs4 import BeautifulSoup

html = '''<html>
            <body>
              <ul id="target">
                <li class="row">
                  Name
                  <ul class="details">
                    <li class="Col1">Val1</li>
                    <li class="Col2">Val2</li>
                  </ul>
                </li>
              </ul>
            </body>
          </html>'''

# Parse the HTML string
soup = BeautifulSoup(html, 'lxml')

# Select the target <ul> and build dicts for each row
data_dicts = []
target = soup.select('#target')[0]
for row in target.select('.row'):
    row_dict = {}
    row_dict['name'] = row.contents[0].strip() # Remove excess whitespace
    details = row.select('.details')
    for col in details[0].findChildren('li'):
        col_name = col.attrs['class'][0]
        col_value = col.text.strip()
        row_dict[col_name] = col_value
    data_dicts.append(row_dict)

# Convert list of dicts to dataframe
df = pd.DataFrame(data_dicts)

Output:

   Col1  Col2  name
0  Val1  Val2  Name

Some combination of findChildren and select should let you extract each sub-component of the based table in the site you linked. BeautifulSoup has a lot of ways of digging through HTML, so I strongly recommend working through some examples and looking through the documentation if you get stuck trying to parse out a specific set of elements.

Collectives™ on Stack Overflow

Scraping data with Python and Pandas

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related