1

I am trying to scrape the following website: https://www.linkedin.com/jobs/search/?keywords=coach%20&location=United%20States&locationId=us%3A0

The text I want to get is:

Showing 114,877 results

the HTML code:

<div class="jobs-search-results__count-sort pt3">
            <div class="jobs-search-results__count-string results-count-string Sans-15px-black-55% pb0 pl5 pr4">
                Showing 114,877 results
            </div>

My python code is:

index_url = 'https://www.linkedin.com/jobs/search/?keywords=coach%20&location=United%20States&locationId=us%3A0'

    java = '!function(i,n){void 0!==i.addEventListener&&void 0!==i.hidden&&(n.liVisibilityChangeListener=function(){i.hidden&&(n.liHasWindowHidden=!0)},i.addEventListener("visibilitychange",n.liVisibilityChangeListener))}(document,window);'
    browser = webdriver.PhantomJS()
    browser.get(index_url)
    browser.execute_script(java)
    soup = BeautifulSoup(browser.page_source, "html.parser")
    link = "jobs-search-results__count-string results-count-string Sans-15px-black-55% pb0 pl5 pr4" 
    div = soup.find('div', {"class":link})
    text = div.text

So far it looks like my code is not working. I think it was to do something with the execution of the java script.

I get the following error:


AttributeError                            Traceback (most recent call last)
<ipython-input-33-7cdc1c4e0894> in <module>()
      6 link = "jobs-search-results__count-string results-count-string Sans-15px-black-55% pb0 pl5 pr4"
      7 div = soup.find('div', {"class":link})
----> 8 text = div.text

AttributeError: 'NoneType' object has no attribute 'text'

soup output:

<html><head>\n<script type="text/javascript">\nwindow.onload = function() {\n  // Parse the tracking code from cookies.\n  var trk = "bf";\n  var trkInfo = "bf";\n  var cookies = document.cookie.split("; ");\n  for (var i = 0; i < cookies.length; ++i) {\n    if ((cookies[i].indexOf("trkCode=") == 0) && (cookies[i].length > 8)) {\n      trk = cookies[i].substring(8);\n    }\n    else if ((cookies[i].indexOf("trkInfo=") == 0) && (cookies[i].length > 8)) {\n      trkInfo = cookies[i].substring(8);\n    }\n  }\n\n  if (window.location.protocol == "http:") {\n    // If "sl" cookie is set, redirect to https.\n    for (var i = 0; i < cookies.length; ++i) {\n      if ((cookies[i].indexOf("sl=") == 0) && (cookies[i].length > 3)) {\n        window.location.href = "https:" + window.location.href.substring(window.location.protocol.length);\n        return;\n      }\n    }\n  }\n\n  // Get the new domain. For international domains such as\n  // fr.linkedin.com, we convert it to www.linkedin.com\n  var domain = "www.linkedin.com";\n  if (domain != location.host) {\n    var subdomainIndex = location.host.indexOf(".linkedin");\n    if (subdomainIndex != -1) {\n      domain = "www" + location.host.substring(subdomainIndex);\n    }\n  }\n\n  window.location.href = "https://" + domain + "/authwall?trk=" + trk + "&trkInfo=" + trkInfo +\n      "&originalReferer=" + document.referrer.substr(0, 200) +\n      "&sessionRedirect=" + encodeURIComponent(window.location.href);\n}\n</script>\n</head><body></body></html>
1
  • Curious enough, when accessing using Chrome webdriver, the text in context is inside div = soup.find('div', {"class":"result-context"}). It could be falling into a logging in modal dialog when using PhantomJS. Commented Aug 2, 2017 at 4:18

1 Answer 1

1

I have the solution in webdriver.Chrome, because I have never used PhantomJS. There are two cases if you want to get the results text. One is that you are logged in on Linkedin from the driver instance and other is that you are not logged in.

Let's suppose you are not logged in. So the following code will get your work done

from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
url = 'https://www.linkedin.com/jobs/search/?keywords=coach%20&location=United%20States&locationId=us%3A0'
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'html.parser')
text = soup.find('div',{'class':'results-context'}).text
print(text)

Suppose you are logged in

from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
url = 'https://www.linkedin.com/jobs/search/?keywords=coach%20&location=United%20States&locationId=us%3A0'
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'html.parser')

class = 'jobs-search-results__count-string results-count-string Sans-15px-black-55% pb0 pl5 pr4'
text = soup.find('div',{'class':class}).text.split('\n')[1].lstrip()
print(text)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.