0

I have this assignment of extracting some items from each row of a table in HTML. I have figured out how to grab the whole table from the web using Selenium with Python. Following is the code for that:

from selenium import webdriver
import time 
import pandas as pd

mydriver = webdriver.Chrome('C:/Program Files/chromedriver.exe')
mydriver.get("https://www.bseindia.com/corporates/ann.aspx?expandable=0")

time.sleep(5) # wait 5 seconds until DOM will load completly
table = mydriver.find_element_by_xpath('//*[@id="ctl00_ContentPlaceHolder1_lblann"]/table/tbody')

for row in table.find_elements_by_xpath('./tr'):
    print(row.text)

I am unable to understand the way I can grab specific items from the table itself. Following are the items that I require:

  1. Company Name

  2. PDF Link(if it does not exist, write "No PDF Link")

  3. Received Time

  4. Dessiminated Time

  5. Time Taken

  6. Description

Any help in logic would be helpful. Thanks in Advance.

8
  • Unless it is a requirement to use Selenium, you can parse all the data using only BeautifulSoup, as the data is not dynamically loaded with Javascript. Commented Jun 19, 2018 at 6:30
  • For example, table = soup.find('table', attrs={'cellpadding':"4", 'cellspacing':"1", 'width':"100%", 'border':"0"}) will get the entire table, then get each row in the table with table.find_all('tr'). For example row.find('td', attrs={'class': "TTHeadergrey"} will get items 1, 2-6. and row.find('a', attrs={'class':"tablebluelink"})['href'] will get the PDF Link. Commented Jun 19, 2018 at 6:31
  • How do I validate for the fact if there is a company announcement without a PDF Link, I need to add "No PDF Link". I am actually making a table of all the items that i gather. So I need the correct PDF Links linked with the correct company in my table. Since there will be some random companies in the middle without the PDF Link, I am not understanding how to handle that error and make sure that I have "No PDF Link" written beside the company with no PDF Link on the web page. Commented Jun 19, 2018 at 6:44
  • I had also done it the way you told me. That gives me extra "No PDF Link" in the list. Here is the logic I used: <code> s = soup.find_all('td', {'class' : 'TTHeadergrey'}) for item in s: if not item.has_attr('style') and not item.has_attr('valign'): pdf.append("No PDF Link") else: for i in item.select('a'): if i.has_attr('href') and '.pdf' in i['href']: pdf.append(i['href']) </code> Commented Jun 19, 2018 at 6:57
  • If I want to do this row wise, how would I do that? Commented Jun 19, 2018 at 9:16

2 Answers 2

1
for tr in mydriver.find_elements_by_xpath('//*[@id="ctl00_ContentPlaceHolder1_lblann"]/table//tr'):
    tds = tr.find_elements_by_tag_name('td')
    print ([td.text for td in tds])
Sign up to request clarification or add additional context in comments.

3 Comments

Hello thank you for trying to answer this question, can you briefly explain how your solution addresses OP's problem?
Thank you sunitha for your answer. From what you have suggested, it gives me a 2D array of all the text. I have 2 questions with that regard: 1. How do I format it according to the different things I need. 2. How do I get the links of the PDFs for each of the company announcement. Also, I need a "No PDF Link" if there is no PDF for the company announcement.
What if I want to do this row wise? How do you think I will do that?
0

I went through a rough time to get this working. I think it works just fine now. Its pretty inefficient though. Following is the code:

from selenium import webdriver
import time 
import pandas as pd
from selenium.common.exceptions import NoSuchElementException

mydriver = webdriver.Chrome('C:/Program Files/chromedriver.exe')
mydriver.get("https://www.bseindia.com/corporates/ann.aspx?expandable=0")
time.sleep(5) # wait 5 seconds until DOM will load completly

trs = mydriver.find_elements_by_xpath('//*[@id="ctl00_ContentPlaceHolder1_lblann"]/table/tbody/tr')
del trs[0]

names = []
r_time = []
d_time = []
t_taken = []
desc = []
pdfs = []
codes = []

i = 0
while i < len(trs):
    names.append(trs[i].text)

    l = trs[i].text.split()
    for item in l:
        try:
            code = int(item)
            if code > 100000:
                codes.append(code)
        except:
            pass

    link = trs[i].find_elements_by_tag_name('td')
    pdf_count = 2
    while pdf_count < len(link):
        try:
            pdf = link[pdf_count].find_element_by_tag_name('a')
            pdfs.append(pdf.get_attribute('href'))
        except NoSuchElementException:
            pdfs.append("No PDF")
        pdf_count = pdf_count + 4

    time = trs[i + 1].text.split()
    if len(time) == 5:
        r_time.append("No Time Given")
        d_time.append(time[3] + " " + time[4])
        t_taken.append("No Time Given")
    else:
        r_time.append(time[3] + " " + time[4])
        d_time.append(time[8] + " " + time[9])
        t_taken.append(time[12])

    desc.append(trs[i+2].text)

    i = i + 4

df = pd.DataFrame.from_dict({'Name':names,'Description':desc, 'PDF Link' : pdfs,'Company Code' : codes, 'Received Time' : r_time, 'Disseminated Time' : d_time, 'Time Taken' : t_taken})
df.to_excel('corporate.xlsx', header=True, index=False) #print the data in the excel sheet. 

Also, I have added another aspect that was asked, I got the company code in another column as well. Thats the result I get.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.