Extracting Data from a Table in HTML using Selenium and Python

Question

I have this assignment of extracting some items from each row of a table in HTML. I have figured out how to grab the whole table from the web using Selenium with Python. Following is the code for that:

from selenium import webdriver
import time 
import pandas as pd

mydriver = webdriver.Chrome('C:/Program Files/chromedriver.exe')
mydriver.get("https://www.bseindia.com/corporates/ann.aspx?expandable=0")

time.sleep(5) # wait 5 seconds until DOM will load completly
table = mydriver.find_element_by_xpath('//*[@id="ctl00_ContentPlaceHolder1_lblann"]/table/tbody')

for row in table.find_elements_by_xpath('./tr'):
    print(row.text)

I am unable to understand the way I can grab specific items from the table itself. Following are the items that I require:

Company Name
PDF Link(if it does not exist, write "No PDF Link")
Received Time
Dessiminated Time
Time Taken
Description

Any help in logic would be helpful. Thanks in Advance.

Unless it is a requirement to use Selenium, you can parse all the data using only BeautifulSoup, as the data is not dynamically loaded with Javascript. — chickity china chinese chicken
– chickity china chinese chicken, Commented Jun 19, 2018 at 6:30
For example, table = soup.find('table', attrs={'cellpadding':"4", 'cellspacing':"1", 'width':"100%", 'border':"0"}) will get the entire table, then get each row in the table with table.find_all('tr'). For example row.find('td', attrs={'class': "TTHeadergrey"} will get items 1, 2-6. and row.find('a', attrs={'class':"tablebluelink"})['href'] will get the PDF Link. — chickity china chinese chicken
– chickity china chinese chicken, Commented Jun 19, 2018 at 6:31
How do I validate for the fact if there is a company announcement without a PDF Link, I need to add "No PDF Link". I am actually making a table of all the items that i gather. So I need the correct PDF Links linked with the correct company in my table. Since there will be some random companies in the middle without the PDF Link, I am not understanding how to handle that error and make sure that I have "No PDF Link" written beside the company with no PDF Link on the web page. — user9929646
– user9929646, Commented Jun 19, 2018 at 6:44
I had also done it the way you told me. That gives me extra "No PDF Link" in the list. Here is the logic I used: <code> s = soup.find_all('td', {'class' : 'TTHeadergrey'}) for item in s: if not item.has_attr('style') and not item.has_attr('valign'): pdf.append("No PDF Link") else: for i in item.select('a'): if i.has_attr('href') and '.pdf' in i['href']: pdf.append(i['href']) </code> — user9929646
– user9929646, Commented Jun 19, 2018 at 6:57

Sunitha · Accepted Answer · 2018-06-18 17:19:47Z

1

for tr in mydriver.find_elements_by_xpath('//*[@id="ctl00_ContentPlaceHolder1_lblann"]/table//tr'):
    tds = tr.find_elements_by_tag_name('td')
    print ([td.text for td in tds])

edited Jun 18, 2018 at 17:19

answered Jun 18, 2018 at 16:52

Sunitha

12.1k2 gold badges23 silver badges23 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

James Wong Over a year ago

Hello thank you for trying to answer this question, can you briefly explain how your solution addresses OP's problem?

user9929646 Over a year ago

Thank you sunitha for your answer. From what you have suggested, it gives me a 2D array of all the text. I have 2 questions with that regard: 1. How do I format it according to the different things I need. 2. How do I get the links of the PDFs for each of the company announcement. Also, I need a "No PDF Link" if there is no PDF for the company announcement.

user9929646 Over a year ago

What if I want to do this row wise? How do you think I will do that?

user9929646 · Accepted Answer · 2018-06-20 07:23:47Z

I went through a rough time to get this working. I think it works just fine now. Its pretty inefficient though. Following is the code:

from selenium import webdriver
import time 
import pandas as pd
from selenium.common.exceptions import NoSuchElementException

mydriver = webdriver.Chrome('C:/Program Files/chromedriver.exe')
mydriver.get("https://www.bseindia.com/corporates/ann.aspx?expandable=0")
time.sleep(5) # wait 5 seconds until DOM will load completly

trs = mydriver.find_elements_by_xpath('//*[@id="ctl00_ContentPlaceHolder1_lblann"]/table/tbody/tr')
del trs[0]

names = []
r_time = []
d_time = []
t_taken = []
desc = []
pdfs = []
codes = []

i = 0
while i < len(trs):
    names.append(trs[i].text)

    l = trs[i].text.split()
    for item in l:
        try:
            code = int(item)
            if code > 100000:
                codes.append(code)
        except:
            pass

    link = trs[i].find_elements_by_tag_name('td')
    pdf_count = 2
    while pdf_count < len(link):
        try:
            pdf = link[pdf_count].find_element_by_tag_name('a')
            pdfs.append(pdf.get_attribute('href'))
        except NoSuchElementException:
            pdfs.append("No PDF")
        pdf_count = pdf_count + 4

    time = trs[i + 1].text.split()
    if len(time) == 5:
        r_time.append("No Time Given")
        d_time.append(time[3] + " " + time[4])
        t_taken.append("No Time Given")
    else:
        r_time.append(time[3] + " " + time[4])
        d_time.append(time[8] + " " + time[9])
        t_taken.append(time[12])

    desc.append(trs[i+2].text)

    i = i + 4

df = pd.DataFrame.from_dict({'Name':names,'Description':desc, 'PDF Link' : pdfs,'Company Code' : codes, 'Received Time' : r_time, 'Disseminated Time' : d_time, 'Time Taken' : t_taken})
df.to_excel('corporate.xlsx', header=True, index=False) #print the data in the excel sheet.

Also, I have added another aspect that was asked, I got the company code in another column as well. Thats the result I get.

Collectives™ on Stack Overflow

Extracting Data from a Table in HTML using Selenium and Python

2 Answers 2

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related