python pandas read HTML table

Question

pd.read_html is reading only first 5 rows from (zeroth) table. How to read whole table using pd.read_html?

I have tried below code:

import pandas as pd
import requests
from urllib.error import HTTPError

try:
    url = "https://clinicaltrials.gov/ct2/history/NCT02954874"
    html_data2 = requests.get(url)
    df = pd.read_html(html_data2.text)[0]
    data = df.head()
    print(data)
except HTTPError as http_error:
    print("HTTP error: ", http_error)

change data = df.head() to data = df or just data = pd.read_html(html_data2.text)[0] and get rid of the extra line — anky
– anky, Commented Jan 2, 2020 at 15:04
@anky_91: Thanks, this worked. Please post an answer. I will accept the same. — Harsha Biyani
– Harsha Biyani, Commented Jan 2, 2020 at 15:06
@anky_91: please post that as an answer. Also useful to show a snippet of the two outputs and how they differ... — smci
– smci, Commented Jan 2, 2020 at 15:11

anky · Accepted Answer · 2021-05-20 09:34:39Z

You are assigning data as df.head() which returns the first 5 rows of a dataframe. Instead you can do:

url = "https://clinicaltrials.gov/ct2/history/NCT02954874"
html_data2 = requests.get(url)
df = pd.read_html(html_data2.text)[0]
data = df #not df.head()

Also , pandas is capable to read html directly so you can just do:

data = pd.read_html(r"https://clinicaltrials.gov/ct2/history/NCT02954874")[0]

and feed that under your try and except statement.

Outputs:

url = "https://clinicaltrials.gov/ct2/history/NCT02954874"
html_data2 = requests.get(url)
df = pd.read_html(html_data2.text)[0]
data = df.head()
print(data)

   Version   A   B     Submitted Date                               Changes
0        1 NaN NaN   November 3, 2016  Nothing (earliest Version on record)
1        2 NaN NaN  November 24, 2016   Contacts/Locations and Study Status
2        3 NaN NaN  November 28, 2016   Recruitment Status and Study Status
3        4 NaN NaN  December 15, 2016   Contacts/Locations and Study Status
4        5 NaN NaN  December 19, 2016   Contacts/Locations and Study Status

Vs

url = "https://clinicaltrials.gov/ct2/history/NCT02954874"
html_data2 = requests.get(url)
df = pd.read_html(html_data2.text)[0]
data = df
print(data)

     Version   A   B     Submitted Date                               Changes
0          1 NaN NaN   November 3, 2016  Nothing (earliest Version on record)
1          2 NaN NaN  November 24, 2016   Contacts/Locations and Study Status
2          3 NaN NaN  November 28, 2016   Recruitment Status and Study Status
3          4 NaN NaN  December 15, 2016   Contacts/Locations and Study Status
4          5 NaN NaN  December 19, 2016   Contacts/Locations and Study Status
..       ...  ..  ..                ...                                   ...
558      559 NaN NaN  December 19, 2019   Contacts/Locations and Study Status
559      560 NaN NaN  December 20, 2019   Contacts/Locations and Study Status
560      561 NaN NaN  December 23, 2019   Contacts/Locations and Study Status
561      562 NaN NaN  December 25, 2019   Contacts/Locations and Study Status
562      563 NaN NaN  December 27, 2019   Contacts/Locations and Study Status

[563 rows x 5 columns]

Collectives™ on Stack Overflow

python pandas read HTML table

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related