Python Scrapy - Scraping data from multiple website URLs

Question

For one of my web project I need to scrape data from different web sources. To keep it simple i am explaining with an example.

Lets say i want to scrape the data about mobiles listed in their manufacturer site.

http://www.somebrand1.com/mobiles/ . . http://www.somebrand3.com/phones/

I have huge list of URLs. Every brand's page will have their own way of HTML presentation for browser.

How can i write a normalized script to traverse the HTML of those listing web page URLs and scrape the data irrespective of the format they are in?

Or else do i need to write a script to scrape data from every pattern?

Community · Accepted Answer · 2017-05-23 12:27:26Z

4

This is called a Broad Crawling and, generally speaking, this is not an easy thing to implement because of the different nature, representation, loading mechanisms web-sites use.

The general idea would be to have a generic spider and some sort of a site-specific configuration where you would have a mapping between item fields and xpath expressions or CSS selectors used to retrieve the field values from the page. In a real life, things are not that simple as it seems, some fields would require post-processing, other fields would need to be extracted after sending a separate request etc. In other words, it would be very difficult to keep generic and reliable at the same time.

The generic spider should receive a target site as a parameter, read the site-specific configuration and crawl the site according to it.

Also see:

Broad Crawls

edited May 23, 2017 at 12:27

CommunityBot

11 silver badge

answered Nov 18, 2014 at 6:31

alecxe

476k127 gold badges1.1k silver badges1.2k bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

bubi Over a year ago

Thanks for the quick reply. And is there an anyway that i could minimize the process time. Provided a huge list, what would be the effective way?

alecxe Over a year ago

@Bubi you are welcome. Do you mean you have a huge list of web-sites you want to crawl?

bubi Over a year ago

Yes. I meant that. I have a big list of domains to crawl.

alecxe Over a year ago

@Bubi yeah, got it. I'd keep your domains in the database. Then, separately, go through your urls one-by-one and create field annotations and save them in the database per-domain. In the spider, read the domains from the database and start requests. It's difficult to help without knowing the specifics. Hope, at least, things are more clear now.

bubi Over a year ago

Yes this seems to be good one. Will give a try. Thank you for the suggestion!

Collectives™ on Stack Overflow

Python Scrapy - Scraping data from multiple website URLs

1 Answer 1

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related