3

For one of my web project I need to scrape data from different web sources. To keep it simple i am explaining with an example.

Lets say i want to scrape the data about mobiles listed in their manufacturer site.

http://www.somebrand1.com/mobiles/ . . http://www.somebrand3.com/phones/

I have huge list of URLs. Every brand's page will have their own way of HTML presentation for browser.

How can i write a normalized script to traverse the HTML of those listing web page URLs and scrape the data irrespective of the format they are in?

Or else do i need to write a script to scrape data from every pattern?

1 Answer 1

4

This is called a Broad Crawling and, generally speaking, this is not an easy thing to implement because of the different nature, representation, loading mechanisms web-sites use.

The general idea would be to have a generic spider and some sort of a site-specific configuration where you would have a mapping between item fields and xpath expressions or CSS selectors used to retrieve the field values from the page. In a real life, things are not that simple as it seems, some fields would require post-processing, other fields would need to be extracted after sending a separate request etc. In other words, it would be very difficult to keep generic and reliable at the same time.

The generic spider should receive a target site as a parameter, read the site-specific configuration and crawl the site according to it.

Also see:

Sign up to request clarification or add additional context in comments.

5 Comments

Thanks for the quick reply. And is there an anyway that i could minimize the process time. Provided a huge list, what would be the effective way?
@Bubi you are welcome. Do you mean you have a huge list of web-sites you want to crawl?
Yes. I meant that. I have a big list of domains to crawl.
@Bubi yeah, got it. I'd keep your domains in the database. Then, separately, go through your urls one-by-one and create field annotations and save them in the database per-domain. In the spider, read the domains from the database and start requests. It's difficult to help without knowing the specifics. Hope, at least, things are more clear now.
Yes this seems to be good one. Will give a try. Thank you for the suggestion!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.