0

There is a web page which is partially generated with JS: https://www.ncbi.nlm.nih.gov/genome/genomes/971

I want to scrape the links in FTP column. All of them are JS generated.

By default, scrapy gets only HTML without executing JS. How can I change it?

1
  • Try using scrapy with splash or scrapy with selenium... Commented Jan 16, 2018 at 13:42

1 Answer 1

1

If you are about to scrape a page that generates its content dynamically, the first thing to do is to look for an API being called. In your browser's development tools, look for XHR requests in the network tab. For the page you refer to, I can see request for

https://www.ncbi.nlm.nih.gov/genomes/Genome2BE/genome2srv.cgi?action=GetGenomes4Grid&genome_id=971&genome_assembly_id=&king=Bacteria&mode=2&flags=1&page=1&pageSize=100.

If you look in the response, you'll see that it contains the links that are under the FTP column on the page. You can simply use this API to get the information you need.

If you really want to render the page and scrape it, I suggest you use Splash. The best way to integrate it with Scrapy is using scrapy-splash library.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.