1

pd.read_html is reading only first 5 rows from (zeroth) table. How to read whole table using pd.read_html?

I have tried below code:

import pandas as pd
import requests
from urllib.error import HTTPError

try:
    url = "https://clinicaltrials.gov/ct2/history/NCT02954874"
    html_data2 = requests.get(url)
    df = pd.read_html(html_data2.text)[0]
    data = df.head()
    print(data)
except HTTPError as http_error:
    print("HTTP error: ", http_error)
3
  • 4
    change data = df.head() to data = df or just data = pd.read_html(html_data2.text)[0] and get rid of the extra line Commented Jan 2, 2020 at 15:04
  • @anky_91: Thanks, this worked. Please post an answer. I will accept the same. Commented Jan 2, 2020 at 15:06
  • 3
    @anky_91: please post that as an answer. Also useful to show a snippet of the two outputs and how they differ... Commented Jan 2, 2020 at 15:11

1 Answer 1

2

You are assigning data as df.head() which returns the first 5 rows of a dataframe. Instead you can do:

url = "https://clinicaltrials.gov/ct2/history/NCT02954874"
html_data2 = requests.get(url)
df = pd.read_html(html_data2.text)[0]
data = df #not df.head()

Also , pandas is capable to read html directly so you can just do:

data = pd.read_html(r"https://clinicaltrials.gov/ct2/history/NCT02954874")[0]

and feed that under your try and except statement.

Outputs:

url = "https://clinicaltrials.gov/ct2/history/NCT02954874"
html_data2 = requests.get(url)
df = pd.read_html(html_data2.text)[0]
data = df.head()
print(data)

   Version   A   B     Submitted Date                               Changes
0        1 NaN NaN   November 3, 2016  Nothing (earliest Version on record)
1        2 NaN NaN  November 24, 2016   Contacts/Locations and Study Status
2        3 NaN NaN  November 28, 2016   Recruitment Status and Study Status
3        4 NaN NaN  December 15, 2016   Contacts/Locations and Study Status
4        5 NaN NaN  December 19, 2016   Contacts/Locations and Study Status

Vs

url = "https://clinicaltrials.gov/ct2/history/NCT02954874"
html_data2 = requests.get(url)
df = pd.read_html(html_data2.text)[0]
data = df
print(data)

     Version   A   B     Submitted Date                               Changes
0          1 NaN NaN   November 3, 2016  Nothing (earliest Version on record)
1          2 NaN NaN  November 24, 2016   Contacts/Locations and Study Status
2          3 NaN NaN  November 28, 2016   Recruitment Status and Study Status
3          4 NaN NaN  December 15, 2016   Contacts/Locations and Study Status
4          5 NaN NaN  December 19, 2016   Contacts/Locations and Study Status
..       ...  ..  ..                ...                                   ...
558      559 NaN NaN  December 19, 2019   Contacts/Locations and Study Status
559      560 NaN NaN  December 20, 2019   Contacts/Locations and Study Status
560      561 NaN NaN  December 23, 2019   Contacts/Locations and Study Status
561      562 NaN NaN  December 25, 2019   Contacts/Locations and Study Status
562      563 NaN NaN  December 27, 2019   Contacts/Locations and Study Status

[563 rows x 5 columns]
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.