0

HTML div class that contains the data I wish to print

enter image description here

<div class="gs_a">LR Binford&nbsp;- American antiquity, 1980 - cambridge.org </div>

This is my code so far :

from selenium import webdriver

def Author (SearchVar):

    driver = webdriver.Chrome("/Users/tutau/Downloads/chromedriver")

    driver.get ("https://scholar.google.com/")

    SearchBox = driver.find_element_by_id ("gs_hdr_tsi")

    SearchBox.send_keys(SearchVar)

    SearchBox.submit()

    At = driver.find_elements_by_css_selector ('#gs_res_ccl_mid > div:nth-child(1) > div.gs_ri > div.gs_a')

    print (At)

Author("dog")

All that comes out when I print is

selenium.webdriver.remote.webelement.WebElement (session="9aa956e2bd51f510dd626f6937b01c0e", element="0.6506218589189958-1")

not the text I am new to selenium Help is appreciated

4
  • 1
    Possible duplicate of How to get text with selenium web driver in python Commented Jun 7, 2018 at 5:07
  • 1
    Can you please paste the HTML. The screenshot is not so helpful. Commented Jun 7, 2018 at 5:07
  • You should use driver.find_element_by_css_selector, rather than driver.find_elements_by_css_selector. And should be print (At.text) Commented Jun 7, 2018 at 5:15
  • 1
    you are printing the element with print(At),use print(At.text) instead, not related but i suggest using requests with Beautifulsoup instead of selenium Commented Jun 7, 2018 at 5:43

3 Answers 3

1

Seems you were almost there. Perhaps, as per the HTML and your code trials you have shared, you are seeing the desired output.

Explaination

Once the following line of code gets executed:

At = driver.find_elements_by_css_selector ('#gs_res_ccl_mid > div:nth-child(1) > div.gs_ri > div.gs_a')

WebElement At refers to the desired element(single element in your list). In your next step, as you invoked print (At) the WebElement At is printed which is as follows:

selenium.webdriver.remote.webelement.WebElement (session="9aa956e2bd51f510dd626f6937b01c0e", element="0.6506218589189958-1")

Solution

Now, as per your question, if you want to extract the text LR Binford - American antiquity, 1980 - cambridge.org, you have to invoke either of the methods through the element:

So you need to change the line of code from:

print (At)

To either of the following:

  • Using text:

    print(At.text)
    
  • Using get_attribute(attributeName):

    print(At.get_attribute("innerHTML"))
    
  • Your own code with minor adjustments:

    # -*- coding: UTF-8 -*-
    from selenium import webdriver
    
    def Author (SearchVar):
    
        options = webdriver.ChromeOptions() 
        options.add_argument("start-maximized")
        options.add_argument('disable-infobars')
        driver=webdriver.Chrome(chrome_options=options, executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe')
        driver.get ("https://scholar.google.com/")
        SearchBox = driver.find_element_by_name("q")
        SearchBox.send_keys(SearchVar)
        SearchBox.submit()
        At = driver.find_elements_by_css_selector ('#gs_res_ccl_mid > div:nth-child(1) > div.gs_ri > div.gs_a')
        for item in At:
            print(item.text)
    
    Author("dog")
    
  • Console Output:

    …, RJ Marles, LS Pellicore, GI Giancaspro, TL Dog - Drug Safety, 2008 - Springer
    
Sign up to request clarification or add additional context in comments.

5 Comments

Well there's an indentation error in the for loop (at the print) and you don't need the 'div' in the CSS selector. Again: this will throw an error in case there is Unicode in the element of interest
I only see this because YOU (among others) helped me with your contributions on SO in the past
Thanks @sudonym Keep an eye over my answers time to time. Your feedback and support always brings the best out of me. Not sure why the indentation doesn't gets pasted as it should. However corrected it and added cushion for Unicode as well. But I am not in favour of any change to OP's approach until and unless it is absolutely necessary. Essentially that kills OP's innovation. Hence css_selector I left untouched.
Get it - I'll keep monitoring your contributions trust on that
cheers bro much appreciated, this code will jot work in atom however had to switch to visual studios. throws a unicode error.
1

Intro

First, I recommend to css-select your target on selenium's page_source using a faster parser.

import lxml
import lxml.html

# put this below SearchBox.submit()

CSS_SELECTOR = '#gs_res_ccl_mid > :nth-child(1) > .gs_ri > .gs_a' # Define css
source = driver.page_source                                       # Get all html
At_raw = lxml.html.document_fromstring(source)                    # Convert
At = At_raw.cssselect(CSS_SELECTOR)                               # Select by CSS

Solution 1

Then, you need to extract the text_content() from your web element and encode it properly.

At = At.text_content().encode('utf-8') # Get text and encode
print At

Solution 2

In case At contains more than one line and unicode, you can also remove those:

At = [l.replace(r'[^\x00-\x7F]+','') for line in At \                 # replace unicode
         for l in line.text_content().strip().encode('utf-8').splitlines() \ # Get text
               if l.strip()]                # only consider if line contains characters
print At

4 Comments

OP explicitly said that wants to get output using selenium in python while you suggests to use lxml which looks much more complicated than simply add the text property...
my proposed solution requires python and selenium. (driver.page_source) . In fact, that is the first sentence of my answer. I suggest to use a different PARSER for performance reasons and I also suggest to use a way of text extraction that works in all scenarios, not just in some.
If textdoesn't work, OP might use get_attribute("textContent"). Also using third-party library to extract one text value doesn't seem to bring much efficiency or improvements
I agree with you. As soon as OP decides to scrape more than one value in the future, my code might help more. I benchmarked this and in essence doubled my throughput/s using sel's page_source + lxml compared to vanilla selenium. In the meanwhile, let's hope his value does not contain any currency symbols.
0

You are printing the element. Print (At.text) instead of At.

1 Comment

AFAIK this won't work if you are dealing with unicode (currency symbols etc.). Also, this won't remove whitespace-only lines and similar artefacts

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.