Trying to get Get text out of div class using selenium in python

Question

HTML div class that contains the data I wish to print

<div class="gs_a">LR Binford&nbsp;- American antiquity, 1980 - cambridge.org </div>

This is my code so far :

from selenium import webdriver

def Author (SearchVar):

    driver = webdriver.Chrome("/Users/tutau/Downloads/chromedriver")

    driver.get ("https://scholar.google.com/")

    SearchBox = driver.find_element_by_id ("gs_hdr_tsi")

    SearchBox.send_keys(SearchVar)

    SearchBox.submit()

    At = driver.find_elements_by_css_selector ('#gs_res_ccl_mid > div:nth-child(1) > div.gs_ri > div.gs_a')

    print (At)

Author("dog")

All that comes out when I print is

selenium.webdriver.remote.webelement.WebElement (session="9aa956e2bd51f510dd626f6937b01c0e", element="0.6506218589189958-1")

not the text I am new to selenium Help is appreciated

Possible duplicate of How to get text with selenium web driver in python — Andersson
– Andersson, Commented Jun 7, 2018 at 5:07
Can you please paste the HTML. The screenshot is not so helpful. — Monika
– Monika, Commented Jun 7, 2018 at 5:07
You should use driver.find_element_by_css_selector, rather than driver.find_elements_by_css_selector. And should be print (At.text) — yong
– yong, Commented Jun 7, 2018 at 5:15
you are printing the element with print(At),use print(At.text) instead, not related but i suggest using requests with Beautifulsoup instead of selenium — raviraja
– raviraja, Commented Jun 7, 2018 at 5:43

undetected Selenium · Accepted Answer · 2018-06-07 12:46:35Z

1

Seems you were almost there. Perhaps, as per the HTML and your code trials you have shared, you are seeing the desired output.

Explaination

Once the following line of code gets executed:

At = driver.find_elements_by_css_selector ('#gs_res_ccl_mid > div:nth-child(1) > div.gs_ri > div.gs_a')

WebElement At refers to the desired element(single element in your list). In your next step, as you invoked print (At) the WebElement At is printed which is as follows:

selenium.webdriver.remote.webelement.WebElement (session="9aa956e2bd51f510dd626f6937b01c0e", element="0.6506218589189958-1")

Solution

Now, as per your question, if you want to extract the text LR Binford - American antiquity, 1980 - cambridge.org, you have to invoke either of the methods through the element:

text: Gets the text of the element.
get_attribute(attributeName): Gets the given attribute or property of the element.

So you need to change the line of code from:

print (At)

To either of the following:

Using text:
```
print(At.text)
```
Using get_attribute(attributeName):
```
print(At.get_attribute("innerHTML"))
```

Your own code with minor adjustments:

# -*- coding: UTF-8 -*-
from selenium import webdriver

def Author (SearchVar):

    options = webdriver.ChromeOptions() 
    options.add_argument("start-maximized")
    options.add_argument('disable-infobars')
    driver=webdriver.Chrome(chrome_options=options, executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe')
    driver.get ("https://scholar.google.com/")
    SearchBox = driver.find_element_by_name("q")
    SearchBox.send_keys(SearchVar)
    SearchBox.submit()
    At = driver.find_elements_by_css_selector ('#gs_res_ccl_mid > div:nth-child(1) > div.gs_ri > div.gs_a')
    for item in At:
        print(item.text)

Author("dog")

Console Output:

…, RJ Marles, LS Pellicore, GI Giancaspro, TL Dog - Drug Safety, 2008 - Springer

edited Jun 7, 2018 at 12:46

answered Jun 7, 2018 at 7:51

undetected Selenium

194k44 gold badges304 silver badges387 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

sudonym Over a year ago

Well there's an indentation error in the for loop (at the print) and you don't need the 'div' in the CSS selector. Again: this will throw an error in case there is Unicode in the element of interest

sudonym Over a year ago

I only see this because YOU (among others) helped me with your contributions on SO in the past

undetected Selenium Over a year ago

Thanks @sudonym Keep an eye over my answers time to time. Your feedback and support always brings the best out of me. Not sure why the indentation doesn't gets pasted as it should. However corrected it and added cushion for Unicode as well. But I am not in favour of any change to OP's approach until and unless it is absolutely necessary. Essentially that kills OP's innovation. Hence css_selector I left untouched.

sudonym Over a year ago

Get it - I'll keep monitoring your contributions trust on that

Te Uruti Tau Over a year ago

cheers bro much appreciated, this code will jot work in atom however had to switch to visual studios. throws a unicode error.

sudonym · Accepted Answer · 2018-06-07 06:06:23Z

1

Intro

First, I recommend to css-select your target on selenium's page_source using a faster parser.

import lxml
import lxml.html

# put this below SearchBox.submit()

CSS_SELECTOR = '#gs_res_ccl_mid > :nth-child(1) > .gs_ri > .gs_a' # Define css
source = driver.page_source                                       # Get all html
At_raw = lxml.html.document_fromstring(source)                    # Convert
At = At_raw.cssselect(CSS_SELECTOR)                               # Select by CSS

Solution 1

Then, you need to extract the text_content() from your web element and encode it properly.

At = At.text_content().encode('utf-8') # Get text and encode
print At

Solution 2

In case At contains more than one line and unicode, you can also remove those:

At = [l.replace(r'[^\x00-\x7F]+','') for line in At \                 # replace unicode
         for l in line.text_content().strip().encode('utf-8').splitlines() \ # Get text
               if l.strip()]                # only consider if line contains characters
print At

edited Jun 7, 2018 at 6:06

answered Jun 7, 2018 at 5:11

sudonym

4,0384 gold badges40 silver badges63 bronze badges

4 Comments

Andersson Over a year ago

OP explicitly said that wants to get output using selenium in python while you suggests to use lxml which looks much more complicated than simply add the text property...

sudonym Over a year ago

my proposed solution requires python and selenium. (driver.page_source) . In fact, that is the first sentence of my answer. I suggest to use a different PARSER for performance reasons and I also suggest to use a way of text extraction that works in all scenarios, not just in some.

Andersson Over a year ago

If textdoesn't work, OP might use get_attribute("textContent"). Also using third-party library to extract one text value doesn't seem to bring much efficiency or improvements

sudonym Over a year ago

I agree with you. As soon as OP decides to scrape more than one value in the future, my code might help more. I benchmarked this and in essence doubled my throughput/s using sel's page_source + lxml compared to vanilla selenium. In the meanwhile, let's hope his value does not contain any currency symbols.

Monika · Accepted Answer · 2018-06-07 05:10:56Z

0

You are printing the element. Print (At.text) instead of At.

answered Jun 7, 2018 at 5:10

Monika

7321 gold badge4 silver badges10 bronze badges

1 Comment

sudonym Over a year ago

AFAIK this won't work if you are dealing with unicode (currency symbols etc.). Also, this won't remove whitespace-only lines and similar artefacts

Collectives™ on Stack Overflow

Trying to get Get text out of div class using selenium in python

3 Answers 3

Explaination

Solution

5 Comments

4 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Explaination

Solution

5 Comments

4 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related