2

I'm trying to obtain a table of results with this code:

import pandas as pd
url = 'https://www.betfair.co.uk/sport/football'
df = pd.read_html(url, header = None)
df[0]

The url may vary if you are not in the UK.

I thought it would be like this bit of code, which works perfectly (I get the table) for me.

import pandas as pd
url = 'https://en.wikipedia.org/wiki/Opinion_polling_for_the_French_presidential_election,_2017'
df = pd.read_html(url, skiprows=3) 
df[0]

In the first example, the html is organized around <ul>, and <li>.

In the second, it's a table proper with .

How can I tweak pandas to obtain the data in the first case?

1 Answer 1

4

Unfortunately, pandas.read_html (docs) only extracts data from HTML tables:

import pandas as pd
html = '''<html>
            <body>
              <table>
                <tr>
                  <th>Col1</th>
                  <th>Col2</th>
                </tr>
                <tr>
                  <td>Val1</td>
                  <td>Val2</td>
                </tr>
              </table>
            </body>
          </html>'''
dfs = pd.read_html(html)
df[0]

Output:

   0     1
0  Col1  Col2
1  Val1  Val2

For the second case where we the HTML contains an unordered list instead, the existing pandas function won't work. You can instead parse the list (and all of it's children) using an HTML parsing library like BeautifulSoup4 and build up the dataframe row-by-row. Here's a simple example:

import pandas as pd
from bs4 import BeautifulSoup

html = '''<html>
            <body>
              <ul id="target">
                <li class="row">
                  Name
                  <ul class="details">
                    <li class="Col1">Val1</li>
                    <li class="Col2">Val2</li>
                  </ul>
                </li>
              </ul>
            </body>
          </html>'''

# Parse the HTML string
soup = BeautifulSoup(html, 'lxml')

# Select the target <ul> and build dicts for each row
data_dicts = []
target = soup.select('#target')[0]
for row in target.select('.row'):
    row_dict = {}
    row_dict['name'] = row.contents[0].strip() # Remove excess whitespace
    details = row.select('.details')
    for col in details[0].findChildren('li'):
        col_name = col.attrs['class'][0]
        col_value = col.text.strip()
        row_dict[col_name] = col_value
    data_dicts.append(row_dict)

# Convert list of dicts to dataframe
df = pd.DataFrame(data_dicts)

Output:

   Col1  Col2  name
0  Val1  Val2  Name

Some combination of findChildren and select should let you extract each sub-component of the based table in the site you linked. BeautifulSoup has a lot of ways of digging through HTML, so I strongly recommend working through some examples and looking through the documentation if you get stuck trying to parse out a specific set of elements.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.