0

I am trying to scrape this website but I keep getting error when I try to print out just the content of the table.

soup = BeautifulSoup(urllib2.urlopen('http://clinicaltrials.gov/show/NCT01718158
').read())

print soup('table')[6].prettify()


for row in soup('table')[6].findAll('tr'):
    tds = row('td')
    print tds[0].string,tds[1].string

IndexError                                Traceback (most recent call last)
<ipython-input-70-da84e74ab3b1> in <module>()
  1 for row in soup('table')[6].findAll('tr'):
  2     tds = row('td')
  3     print tds[0].string,tds[1].string
  4 

IndexError: list index out of range

1 Answer 1

2

The table has a header row, with <th> header elements rather than <td> cells. Your code assumes there will always be <td> elements in each row, and that fails for the first row.

You could skip the row with not enough <td> elements:

for row in soup('table')[6].findAll('tr'):
    tds = row('td')
    if len(tds) < 2:
        continue
    print tds[0].string, tds[1].string

at which point you get output:

>>> for row in soup('table')[6].findAll('tr'):
...     tds = row('td')
...     if len(tds) < 2:
...         continue
...     print tds[0].string, tds[1].string
... 
Responsible Party: Bristol-Myers Squibb
ClinicalTrials.gov Identifier: None
Other Study ID Numbers: AI452-021, 2011‐005409‐65
Study First Received: October 29, 2012
Last Updated: November 7, 2014
Health Authority: None

The last row contains text interspersed with <br/> elements; you could use the element.strings generator to extract all strings and perhaps join them into newlines; I'd strip each string first though:

>>> for row in soup('table')[6].findAll('tr'):
...     tds = row('td')
...     if len(tds) < 2:
...         continue
...     print tds[0].string, '\n'.join(filter(unicode.strip, tds[1].strings))
... 
Responsible Party: Bristol-Myers Squibb
ClinicalTrials.gov Identifier: NCT01718158
History of Changes
Other Study ID Numbers: AI452-021, 2011‐005409‐65
Study First Received: October 29, 2012
Last Updated: November 7, 2014
Health Authority: United States: Institutional Review Board
United States: Food and Drug Administration
Argentina: Administracion Nacional de Medicamentos, Alimentos y Tecnologia Medica
France: Afssaps - Agence française de sécurité sanitaire des produits de santé (Saint-Denis)
Germany: Federal Institute for Drugs and Medical Devices
Germany: Ministry of Health
Israel: Israeli Health Ministry Pharmaceutical Administration
Israel: Ministry of Health
Italy: Ministry of Health
Italy: National Bioethics Committee
Italy: National Institute of Health
Italy: National Monitoring Centre for Clinical Trials - Ministry of Health
Italy: The Italian Medicines Agency
Japan: Pharmaceuticals and Medical Devices Agency
Japan: Ministry of Health, Labor and Welfare
Korea: Food and Drug Administration
Poland: National Institute of Medicines
Poland: Ministry of Health
Poland: Ministry of Science and Higher Education
Poland: Office for Registration of Medicinal Products, Medical Devices and Biocidal Products
Russia: FSI Scientific Center of Expertise of Medical Application
Russia: Ethics Committee
Russia: Ministry of Health of the Russian Federation
Spain: Spanish Agency of Medicines
Taiwan: Department of Health
Taiwan: National Bureau of Controlled Drugs
United Kingdom: Medicines and Healthcare Products Regulatory Agency
Sign up to request clarification or add additional context in comments.

2 Comments

MARTIJN! YOU ARE A "gOD" WITH A SMALL "g". Thanks a Million.
Martijn, how do i get the content of the table previous to this: 'Show 77 Study Locations' on the page: 'clinicaltrials.gov/show/NCT01718158'. It is a table i dont know why i cannot locate it when i have: table[variable]. Thanks.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.