1

I am trying to get a name from a public Linkedin url via python requests (2.7).

The code used to work fine.

import requests
from bs4 import BeautifulSoup

url = "https://www.linkedin.com/in/linustorvalds"
html = requests.get(url).content

link = BeautifulSoup(html).title.text.split("|")[0].replace(" ","")
print link

The desired output is:

linustorvalds

I am getting the following error message:

AttributeError: 'NoneType' object has no attribute 'text'

The issue seems to be that html is not returning the real content of the page. So there is no 'title' found. This is the result of printing html:

<html><head>
<script type="text/javascript">
window.onload = function() {
  var newLocation = "";
  if (window.location.protocol == "http:") {
    var cookies = document.cookie.split("; ");
    for (var i = 0; i < cookies.length; ++i) {
      if ((cookies[i].indexOf("sl=") == 0) && (cookies[i].length > 3)) {
        newLocation = "https:" + window.location.href.substring(window.location.protocol.length);
      }
    }
  }

  if (newLocation.length == 0) {
    var domain = location.host;
    var newDomainIndex = 0;
    if (domain.substr(0, 6) == "touch.") {
      newDomainIndex = 6;
    }
    else if (domain.substr(0, 7) == "tablet.") {
      newDomainIndex = 7;
    }
    if (newDomainIndex) {
      domain = domain.substr(newDomainIndex);
    }
    newLocation = "https://" + domain +  "/uas/login?trk=sentinel_org_block&session_redirect=" + encodeURIComponent(window.location)
  }
  window.location.href = newLocation;
}
</script>
</head></html>

Am I being blocked? What are the possible suggestions to make this code work as before?

Thanks a lot!

1
  • The Javascript there is trying to redirect the user -- window.location.href = newLocation. You probably need to follow that redirect. Commented Apr 19, 2015 at 13:54

1 Answer 1

2

Try setting a User-Agent header:

html = requests.get(url, headers={"User-Agent": "Requests"}).content
Sign up to request clarification or add additional context in comments.

4 Comments

Works like a champ!! Thanks a lot!!
Why we need to setup headers?
@AvinashRaj I do not know. We will probably have to ask Linkedin :D
@AvinashRaj because LinkedIn has an API specifically for retrieving data from their service and they do everything prevent people from scraping the HTML version of their site. You will continuously run into this until you start using the API because they will continuously update their anti-scraping efforts and this will only work for so long.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.