1

I am trying to extract the most recent headlines from the following news site: http://news.sina.com.cn/hotnews/

#save ids of relevant buttons that need to be clicked on the site
buttons_ids = ['Tab21' , 'Tab22', 'Tab32']

#save ids of relevant subsections
con_ids = ['Con11']

#start webdriver, go to site, hover over buttons
driver = webdriver.Chrome()
driver.get("http://news.sina.com.cn/hotnews/")
time.sleep(3)
for button_id in buttons_ids:
    button = driver.find_element_by_id(button_id)
    ActionChains(driver).move_to_element(button).perform()

Then I iterate through each section that I am interested in and within each section through all the headlines which are rows in an HTML table. However, on every iteration, it returns the first element

for con_id in con_ids:
    for news_id in range(2,10):
        print(news_id)
        headline = driver.find_element_by_xpath("//div[@id='"+con_id+"']/table/tbody/tr["+str(news_id)+"]")
        text = headline.find_element_by_xpath("//td[2]/a")
        print(text.get_attribute("innerText"))
        print(text.get_attribute("href"))
        com_no = comment.find_element_by_xpath("//td[3]/a")
        print(com_no.get_attribute("innerText"))

I also tried the following approach by essentially saving the table as a list and then iterating through the rows:

for con_id in con_ids:
    table = driver.find_elements_by_xpath("//div[@id='"+con_id+"']/table/tbody/tr")
    for headline in table:
        text = headline.find_element_by_xpath("//td[2]/a")
        print(text.get_attribute("innerText"))
        print(text.get_attribute("href"))
        com_no = comment.find_element_by_xpath("//td[3]/a")
        print(com_no.get_attribute("innerText"))

In the second case I get exactly the number of headlines in the section, so it apparently correctly picks up the number of rows. However, it is still only returning the first row on all iterations. Where am I going wrong? I know a similar question has been asked here: Selenium Python iterate over a table of rows it is stopping at the first row but I am still unable to figure out where I am going wrong.

3 Answers 3

3

In XPath, queries that begin with // will search relative to the document root; so even though you're calling find_element_by_xpath() on the correct container element, you're breaking out of that scope, thereby performing the same global search and yielding the same result every time.

To constrain your query to descendants of the current element, begin your query with .//, e.g.,:

text = headline.find_element_by_xpath(".//td[2]/a")
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks, Ian, indeed it works if I begin the query like this. I am making this the accepted answer due to the explanation. But Pradeep's updated code works as well.
That's because he updated it to include a . at the beginning of the query. 😉
1

try this:

for con_id in con_ids:
    for news_id in range(2,10):
        print(news_id)
        print("(//div[@id='"+con_id+"']/table/tbody/tr)["+str(news_id)+"]")
        headline = driver.find_element_by_xpath("(//div[@id='"+con_id+"']/table/tbody/tr)["+str(news_id)+"]")
        value = headline.find_element_by_xpath(".//td[2]/a")
        print(value.get_attribute("innerText").encode('utf-8'))

I am able to get the headlines with above code

3 Comments

Thanks for the suggestion. You said it worked for you? Did you get 10 different headlines? Because unfortunately, when I run it your code produces exactly the same as mine. It prints the first headline 10 times. Somehow it always selects the first row even when I explicitly pass it the index of another one.
@Sebastian i have edited my answer , can you try now
@Sebastian I am able to get all 10 headlines with the above updated code , have a look at it once.
0

I was able to solve it by specifying the entire XPath in one go like this:

headline = driver.find_element_by_xpath("(//*[@id='"+con_id+"']/table/tbody/tr["+str(news_id)+"]/td[2]/a)")
print(headline.get_attribute("innerText"))
print(headline.get_attribute("href"))

rather than splitting it into two parts. My only explanation for why it only prints the first row repeatedly is that there is some weird Javascript at work that doesn't let you iterate properly when splitting the request. Or my first version had a syntax error, which I am not aware of. If anyone has a better explanation, I'd be glad to hear it!

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.