Python/selenium webscraping

Question

for link in data_links: driver.get(link)

review_dict = {}
# get the size of company
size = driver.find_element_by_xpath('//[@id="EmpBasicInfo"]//span')

#location = ??? need to get this part as well.

my concern:

I am trying to scrape a website. I am using selenium/python to scrape the "501 to 1000 employees" and "Biotech & Pharmaceuticals" from the span, but I am not able to extract the text element from the website using xpath.I have tried getText, get attribute everything. Please, help!

This is the output for each iteration:I am not getting the text value.

Thank you in advance!

1. What text are you expecting to get? 2. Please post the code as text and not an image, it helps everyone who is trying to help. — Vinícius Figueiredo
– Vinícius Figueiredo, Commented Jul 29, 2017 at 21:35
Thank you for the prompt response. I am trying to get "501 to 1000 employees" and "Biotech & Pharmaceuticals" from the span — Fun-zin
– Fun-zin, Commented Jul 29, 2017 at 21:37
If you know you want to get what is after Size label, it's not that hard using bs4's find() — Vinícius Figueiredo
– Vinícius Figueiredo, Commented Jul 29, 2017 at 21:48
I am trying to use selenium all the way, since some of stuff that I want to scrape are in ajax. — Fun-zin
– Fun-zin, Commented Jul 29, 2017 at 21:53
You can still get the current page source with html = driver.page_source — Vinícius Figueiredo
– Vinícius Figueiredo, Commented Jul 29, 2017 at 21:56

Vinícius Figueiredo · Accepted Answer · 2017-07-29 23:56:01Z

1

It seems you want only the text, instead of interacting with some element, one solution is to use BeautifulSoup to parse the html for you, with selenium getting the code built by JavaScript, you should first get the html content with html = driver.page_source, and then you can do something like:

html ='''
<div id="CompanyContainer">
<div id="EmpBasicInfo">
<div class="">
<div class="infoEntity"></div>
<div class="infoEntity">
<label>Industry</label>
<span class="value">Woodcliff</span>
</div>
<div class="infoEntity">
<label>Size</label>
<span class="value">501 to 1000 employees</span>
</div>
</div>
</div>
</div>
'''  # Just a sample, since I don't have the actual page to interact with.
soup = BeautifulSoup(html, 'html.parser')
>>> soup.find("div", {"id":"EmpBasicInfo"}).findAll("div", {"class":"infoEntity"})[2].find("span").text
'501 to 1000 employees'

Or, of course, avoiding specific indexing and looking for the <label>Size</label>, it should be more readable:

>>> [a.span.text for a in soup.findAll("div", {"class":"infoEntity"}) if (a.label and a.label.text == 'Size')]
['501 to 1000 employees']

Using selenium you can do:

>>> driver.find_element_by_xpath("//*[@id='EmpBasicInfo']/div[1]/div/div[3]/span").text
'501 to 1000 employees'

edited Jul 29, 2017 at 23:56

answered Jul 29, 2017 at 23:32

Vinícius Figueiredo

6,5234 gold badges30 silver badges46 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Fun-zin Over a year ago

I want to use selenium for the whole project instead of using soup. The website has some heavy ajax properties and I need to extract most of my information from that part. Thank you for all your help!

Fun-zin Over a year ago

Thank you so much for your prompt reply and your patience. I used your selenium version, it's working.

Vinícius Figueiredo Over a year ago

I'm glad to help! Please remember to accept the answer if it was helpful, it's an overall good for the community.

Collectives™ on Stack Overflow

Python/selenium webscraping

my concern:

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

my concern:

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related