Requests does not return html anymore - Python

Question

I am trying to get a name from a public Linkedin url via python requests (2.7).

The code used to work fine.

import requests
from bs4 import BeautifulSoup

url = "https://www.linkedin.com/in/linustorvalds"
html = requests.get(url).content

link = BeautifulSoup(html).title.text.split("|")[0].replace(" ","")
print link

The desired output is:

linustorvalds

I am getting the following error message:

AttributeError: 'NoneType' object has no attribute 'text'

The issue seems to be that html is not returning the real content of the page. So there is no 'title' found. This is the result of printing html:

<html><head>
<script type="text/javascript">
window.onload = function() {
  var newLocation = "";
  if (window.location.protocol == "http:") {
    var cookies = document.cookie.split("; ");
    for (var i = 0; i < cookies.length; ++i) {
      if ((cookies[i].indexOf("sl=") == 0) && (cookies[i].length > 3)) {
        newLocation = "https:" + window.location.href.substring(window.location.protocol.length);
      }
    }
  }

  if (newLocation.length == 0) {
    var domain = location.host;
    var newDomainIndex = 0;
    if (domain.substr(0, 6) == "touch.") {
      newDomainIndex = 6;
    }
    else if (domain.substr(0, 7) == "tablet.") {
      newDomainIndex = 7;
    }
    if (newDomainIndex) {
      domain = domain.substr(newDomainIndex);
    }
    newLocation = "https://" + domain +  "/uas/login?trk=sentinel_org_block&session_redirect=" + encodeURIComponent(window.location)
  }
  window.location.href = newLocation;
}
</script>
</head></html>

Am I being blocked? What are the possible suggestions to make this code work as before?

Thanks a lot!

The Javascript there is trying to redirect the user -- window.location.href = newLocation. You probably need to follow that redirect. — jwilner
– jwilner, Commented Apr 19, 2015 at 13:54

madflow · Accepted Answer · 2015-04-19 14:02:10Z

2

Try setting a User-Agent header:

html = requests.get(url, headers={"User-Agent": "Requests"}).content

answered Apr 19, 2015 at 14:02

madflow

8,6933 gold badges45 silver badges61 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Diego Over a year ago

Works like a champ!! Thanks a lot!!

Avinash Raj Over a year ago

Why we need to setup headers?

madflow Over a year ago

@AvinashRaj I do not know. We will probably have to ask Linkedin :D

Ian Stapleton Cordasco Over a year ago

@AvinashRaj because LinkedIn has an API specifically for retrieving data from their service and they do everything prevent people from scraping the HTML version of their site. You will continuously run into this until you start using the API because they will continuously update their anti-scraping efforts and this will only work for so long.

Collectives™ on Stack Overflow

Requests does not return html anymore - Python

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related