Scrape JS generated content with Scrapy and Python

Question

There is a web page which is partially generated with JS: https://www.ncbi.nlm.nih.gov/genome/genomes/971

I want to scrape the links in FTP column. All of them are JS generated.

By default, scrapy gets only HTML without executing JS. How can I change it?

Try using scrapy with splash or scrapy with selenium...

Abhijeetk431
– Abhijeetk431

2018-01-16 13:42:19 +00:00
Commented Jan 16, 2018 at 13:42 — Abhijeetk431
– Abhijeetk431, Commented Jan 16, 2018 at 13:42

Tomáš Linhart · Accepted Answer · 2018-01-16 13:47:27Z

1

If you are about to scrape a page that generates its content dynamically, the first thing to do is to look for an API being called. In your browser's development tools, look for XHR requests in the network tab. For the page you refer to, I can see request for

https://www.ncbi.nlm.nih.gov/genomes/Genome2BE/genome2srv.cgi?action=GetGenomes4Grid&genome_id=971&genome_assembly_id=&king=Bacteria&mode=2&flags=1&page=1&pageSize=100.

If you look in the response, you'll see that it contains the links that are under the FTP column on the page. You can simply use this API to get the information you need.

If you really want to render the page and scrape it, I suggest you use Splash. The best way to integrate it with Scrapy is using scrapy-splash library.

answered Jan 16, 2018 at 13:47

Tomáš Linhart

10.2k1 gold badge30 silver badges42 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Scrape JS generated content with Scrapy and Python

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related