1

enter image description herefor link in data_links: driver.get(link)

review_dict = {}
# get the size of company
size = driver.find_element_by_xpath('//[@id="EmpBasicInfo"]//span')

#location = ??? need to get this part as well.

my concern:

I am trying to scrape a website. I am using selenium/python to scrape the "501 to 1000 employees" and "Biotech & Pharmaceuticals" from the span, but I am not able to extract the text element from the website using xpath.I have tried getText, get attribute everything. Please, help!

This is the output for each iteration:I am not getting the text value.

Thank you in advance!

15
  • 1
    1. What text are you expecting to get? 2. Please post the code as text and not an image, it helps everyone who is trying to help. Commented Jul 29, 2017 at 21:35
  • Thank you for the prompt response. I am trying to get "501 to 1000 employees" and "Biotech & Pharmaceuticals" from the span Commented Jul 29, 2017 at 21:37
  • If you know you want to get what is after Size label, it's not that hard using bs4's find() Commented Jul 29, 2017 at 21:48
  • I am trying to use selenium all the way, since some of stuff that I want to scrape are in ajax. Commented Jul 29, 2017 at 21:53
  • You can still get the current page source with html = driver.page_source Commented Jul 29, 2017 at 21:56

1 Answer 1

1

It seems you want only the text, instead of interacting with some element, one solution is to use BeautifulSoup to parse the html for you, with selenium getting the code built by JavaScript, you should first get the html content with html = driver.page_source, and then you can do something like:

html ='''
<div id="CompanyContainer">
<div id="EmpBasicInfo">
<div class="">
<div class="infoEntity"></div>
<div class="infoEntity">
<label>Industry</label>
<span class="value">Woodcliff</span>
</div>
<div class="infoEntity">
<label>Size</label>
<span class="value">501 to 1000 employees</span>
</div>
</div>
</div>
</div>
'''  # Just a sample, since I don't have the actual page to interact with.
soup = BeautifulSoup(html, 'html.parser')
>>> soup.find("div", {"id":"EmpBasicInfo"}).findAll("div", {"class":"infoEntity"})[2].find("span").text
'501 to 1000 employees'

Or, of course, avoiding specific indexing and looking for the <label>Size</label>, it should be more readable:

>>> [a.span.text for a in soup.findAll("div", {"class":"infoEntity"}) if (a.label and a.label.text == 'Size')]
['501 to 1000 employees']

Using selenium you can do:

>>> driver.find_element_by_xpath("//*[@id='EmpBasicInfo']/div[1]/div/div[3]/span").text
'501 to 1000 employees'
Sign up to request clarification or add additional context in comments.

3 Comments

I want to use selenium for the whole project instead of using soup. The website has some heavy ajax properties and I need to extract most of my information from that part. Thank you for all your help!
Thank you so much for your prompt reply and your patience. I used your selenium version, it's working.
I'm glad to help! Please remember to accept the answer if it was helpful, it's an overall good for the community.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.