Is it right way to scrape other websites contents into my website using simple_html_dom. If it is wrong, suggest me what is the method to display news in my website.
4
-
Hmm... RSS feed? API?odedta– odedta2015-05-14 05:51:26 +00:00Commented May 14, 2015 at 5:51
-
I didn't know about that. Please guide me how it will works?Vignesh Bala– Vignesh Bala2015-05-14 05:53:09 +00:00Commented May 14, 2015 at 5:53
-
1I actually never tried RSS feeds before so I can't be of much help there, but, it shouldn't be complicated as the w3schools tutorial is rather short and straightforward. w3schools.com/webservices/rss_intro.asp As for API, it is application interface, what I mean by that is you can check with that website if they already provide some kind of interface for developers like yourself to withdraw news from their site by use of some function.odedta– odedta2015-05-14 05:54:52 +00:00Commented May 14, 2015 at 5:54
-
1It depends. An API is best, RSS/XML is second best, and scraping is third best. Scraping is the least stable since it is not a recognised mechanism to copy content, and you may find yourself blocked. To aid your long-term scraping, you should add a few seconds delay between each scrape, read/parse/obey robots.txt, use a unique user agent string, and be willing to be blocked if that's what the site owner chooses.halfer– halfer2015-05-14 06:03:49 +00:00Commented May 14, 2015 at 6:03
Add a comment
|
1 Answer
simple_html_dom is some extension I am guessing. If you are looking for something in Core PHP(PHP Extension), use DOMDocument
Basically by scraping you are taking the sites content. And if you are doing the same with their(sites team) consent then its okay, otherwise its not legal(depends on their T&C). Also sites have mechanism to block such acts.
Better ask the site team for content, they might be able to provide the data in much better and simpler way. Like API, RSS or a direct Database.
2 Comments
halfer
If you crawl in the open (i.e. without a proxy) and have an identifiable user agent string, and you don't overload the scrape target, that is what search engines do and is fine in most jurisdictions. However, republishing the data can sometimes be seen as a breach of copyright, depending on the attitude of the author (e.g. search engines OK, price comparison sites not).
halfer
Note that law is not created by terms and conditions, thankfully - law is created by legislators. T&Cs attempt to bind users into a contract they haven't signed, and how binding that is probably depends on the country in question. Usually sites opposed to scraping (e.g. large consumer auction sites) will send a stiff legal letter, which will be far too expensive to challenge in court. Building a scraper service that is not dependent on the scraping of one site is thus very good advice!