python web scraping of a dynamically loading page

Question

Let's say i want to scrape this page: https://twitter.com/nfl

from bs4 import BeautifulSoup
import requests

page = 'https://twitter.com/nfl'
r = requests.get(page)
soup = BeautifulSoup(r.text)
print soup

The more i scroll down on the page, the more results show up. But this above request only gives me the initial load. How do i get all the information of the page as if I were to manually scroll down?

Hi, I am in similar situation as yours, my recommendation is to learn a little bit of js ( that is what I am doing right now). You can actually call the js file with appropriate parameters to make it directly output the data to a file (json most likely). But since I am learning it now, I can't provide a better solution. Correct me if I am wrong. The case I am working on is stocktwits.com/symbol/aapl . I hope it will you a bit. — LegitMe
– LegitMe, Commented Mar 28, 2016 at 8:06

Sabuj Hassan · Accepted Answer · 2014-04-04 12:35:44Z

4

First parse the data-max-id="451819302057164799" value from the html source.

Then using the id 451819302057164799 construct an url like below:

https://twitter.com/i/profiles/show/nfl/timeline?include_available_features=1&include_entities=1&max_id=451819302057164799

Now get the html source of the link and parse using simplejson or any other json library.

Remember, the next page load(when you scroll down) is available from the value "max_id":"451369755908530175" in that json.

answered Apr 4, 2014 at 12:35

Sabuj Hassan

39.7k14 gold badges83 silver badges88 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

jason Over a year ago

https://twitter.com/i/profiles/show/nfl/timeline?include_available_features=1&include_entities=1&max_id=451819302057164799

is this a generic solution for all twitter pages? how do you know how to construct that specific url?

Sabuj Hassan Over a year ago

@jason_cant_code can be. I didn't check. May be the nfl is the key for different pages.

jason Over a year ago

i don't think this works. I'm getting a much shorter page than expected.

Curro Over a year ago

Use requests session to ensure that you keep alive your session in every GET

alecxe · Accepted Answer · 2014-04-04 14:31:13Z

1

Better solution is to use Twitter API.

There are several python twitter API clients, for example:

answered Apr 4, 2014 at 14:31

alecxe

476k127 gold badges1.1k silver badges1.2k bronze badges

3 Comments

jason Over a year ago

good suggestion. I'll probably give that a shot if there is no good generic solution.

alecxe Over a year ago

@jason_cant_code well, this is definitely the way to go if you need the data from twitter.

Curro Over a year ago

Definitely if you can use a API, forget scraping. This is your solution.

Javier · Accepted Answer · 2014-04-04 12:30:58Z

1

If the content is dynamically added with javascript, your best chance is to use selenium to control a headless browser like phantomjs, use the selenium webdriver to simulate the scrolldown, add a wait for the new content to load, and only then extract the html and feed it to your BS parser.

answered Apr 4, 2014 at 12:30

Javier

7525 silver badges14 bronze badges

3 Comments

dorvak Over a year ago

See stackoverflow.com/questions/14583560/… for scrolling down

jason Over a year ago

I'll try this if there is no better solution

Curro Over a year ago

Yes, Selenium is always an option.... but IMO is not the best one. I preffer to figure out the http traffic between browser and server and simulate it using requests, urllib or whatever.... much faster than Selenium.

DisappointedByUnaccountableMod · Accepted Answer · 2021-02-16 11:44:37Z

0

For dynamically generated content, the data is usually in json format. So we have to inspect the page, go to network option and find the link which will give us the data/response on the fly. For example : The page - https://techolution.app.param.ai/jobs/ the data is generated dynamically. For that I got this link - https://techolution.app.param.ai/api/career/get_job/?query=&locations=&category=&job_types=

After that the web scraping becomes a bit easy and I have done that in python using Anaconda Navigator. Here is the github link for that - https://github.com/piperaprince01/Webscraping_python/blob/master/WebScraping.ipynb

If you can make any changes to make it better then feel free to do so. Thank You.

edited Feb 16, 2021 at 11:44

DisappointedByUnaccountableMod

6,8444 gold badges21 silver badges23 bronze badges

answered Apr 17, 2019 at 13:45

Prince Pipera

11 bronze badge

Collectives™ on Stack Overflow

python web scraping of a dynamically loading page

4 Answers 4

4 Comments

3 Comments

3 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

4 Comments

3 Comments

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related