Can Javascript read the source of any web page?

Question

I am working on screen scraping, and want to retrieve the source code a particular page.

How can achieve this with javascript? Please help me.

Here is similar page you may get your answer as it solve my problem of getting the source of the HTML Page stackoverflow.com/questions/1367587/javascript-page-source-code — Asim Sajjad
– Asim Sajjad, Commented May 11, 2012 at 10:11
@mikenvck Why did you even mention PHP when the question was about JavaScript? The answers below show how to do this with JavaScript. — corgrath
– corgrath, Commented Jul 21, 2012 at 15:13
to get source of a link, you may need to use $.ajax for external links. here is the solution - stackoverflow.com/a/18447625/2657601 — otaxige_aol
– otaxige_aol, Commented Aug 26, 2013 at 15:36
Not a single answer was native Javascript, all of them were jquery based. — ILikeTacos
– ILikeTacos, Commented Feb 3, 2014 at 19:57
jQuery is native JavaScript. It's just JavaScript you can copy from jquery.com instead of from stackoverflow.com. — Quentin
– Quentin, Commented Mar 6, 2015 at 17:31

CodeTalker · Accepted Answer · 2019-12-09 00:42:22Z

Simple way to start, try jQuery

$("#links").load("/Main_Page #jq-p-Getting-Started li");

More at jQuery Docs

Another way to do screen scraping in a much more structured way is to use YQL or Yahoo Query Language. It will return the scraped data structured as JSON or xml.
e.g.
Let's scrape stackoverflow.com

select * from html where url="http://stackoverflow.com"

will give you a JSON array (I chose that option) like this

 "results": {
   "body": {
    "noscript": [
     {
      "div": {
       "id": "noscript-padding"
      }
     },
     {
      "div": {
       "id": "noscript-warning",
       "p": "Stack Overflow works best with JavaScript enabled"
      }
     }
    ],
    "div": [
     {
      "id": "notify-container"
     },
     {
      "div": [
       {
        "id": "header",
        "div": [
         {
          "id": "hlogo",
          "a": {
           "href": "/",
           "img": {
            "alt": "logo homepage",
            "height": "70",
            "src": "http://i.stackoverflow.com/Content/Img/stackoverflow-logo-250.png",
            "width": "250"
           }
……..

The beauty of this is that you can do projections and where clauses which ultimately gets you the scraped data structured and only the data what you need (much less bandwidth over the wire ultimately)
e.g

select * from html where url="http://stackoverflow.com" and
      xpath='//div/h3/a'

will get you

 "results": {
   "a": [
    {
     "href": "/questions/414690/iphone-simulator-port-for-windows-closed",
     "title": "Duplicate: Is any Windows simulator available to test iPhone application? as a hobbyist who cannot afford a mac, i set up a toolchain kit locally on cygwin to compile objecti … ",
     "content": "iphone\n                simulator port for windows [closed]"
    },
    {
     "href": "/questions/680867/how-to-redirect-the-web-page-in-flex-application",
     "title": "I have a button control ....i need another web page to be redirected while clicking that button .... how to do that ? Thanks ",
     "content": "How\n                to redirect the web page in flex application ?"
    },
…..

Now to get only the questions we do a

select title from html where url="http://stackoverflow.com" and
      xpath='//div/h3/a'

Note the title in projections

 "results": {
   "a": [
    {
     "title": "I don't want the function to be entered simultaneously by multiple threads, neither do I want it to be entered again when it has not returned yet. Is there any approach to achieve … "
    },
    {
     "title": "I'm certain I'm doing something really obviously stupid, but I've been trying to figure it out for a few hours now and nothing is jumping out at me. I'm using a ModelForm so I can … "
    },
    {
     "title": "when i am going through my project in IE only its showing errors A runtime error has occurred Do you wish to debug? Line 768 Error:Expected')' Is this is regarding any script er … "
    },
    {
     "title": "I have a java batch file consisting of 4 execution steps written for analyzing any Java application. In one of the steps, I'm adding few libs in classpath that are needed for my co … "
    },
    {
……

Once you write your query it generates a url for you

http://query.yahooapis.com/v1/public/yql?q=select%20title%20from%20html%20where%20url%3D%22http%3A%2F%2Fstackoverflow.com%22%20and%0A%20%20%20%20%20%20xpath%3D'%2F%2Fdiv%2Fh3%2Fa'%0A%20%20%20%20&format=json&callback=cbfunc

in our case.

So ultimately you end up doing something like this

var titleList = $.getJSON(theAboveUrl);

and play with it.

Beautiful, isn’t it?

Brilliant, especially for hinting to the poor-man's solution at yahoo that eliminates the need for a proxy to fetch the data. Thank you!! I took the liberty to fix the last demo-link to query.yahooapis.com: it was missing a % sign in the url-encoding. Cool that this still works!!
Any idea how to scrape image and meta description from amazon.in/Xiaomi-Redmi-4A-Grey-16GB/dp/… ?
query.yahooapis has been retired as of Jan. 2019. Looks really neat, too bad we can't use it now. See tweet here: twitter.com/ydn/status/1079785891558653952?ref_src=twsrc%5Etfw

karim79 · Accepted Answer · 2009-03-27 00:41:11Z

33

Javascript can be used, as long as you grab whatever page you're after via a proxy on your domain:

<html>
<head>
<script src="/js/jquery-1.3.2.js"></script>
</head>
<body>
<script>
$.get("www.mydomain.com/?url=www.google.com", function(response) { 
    alert(response) 
});
</script>
</body>

edited Mar 27, 2009 at 0:41

answered Mar 25, 2009 at 8:06

karim79

343k67 gold badges420 silver badges409 bronze badges

5 Comments

Ravindranath Akila Over a year ago

Why is a domain based proxy required?

Ferdi265 Over a year ago

because of the Same Origin Policy

S Meaden Over a year ago

that's really interesting. presumably there is some code to install on the server to make that happen?

S Meaden Over a year ago

@ejbytes: actually I think node.js has some modules. I'm presuming OP wants to web scrape.

Gerrit B Over a year ago

You will get a 'from origin 'null' has been blocked by CORS policy: No 'Access-Control-Allow-Origin' header is present on the requested resource.' if you are not on the same domain though

Emma Marcier · Accepted Answer · 2020-11-20 01:44:30Z

8

You can use fetch:

const URL = 'https://www.sap.com/belgique/index.html';
fetch(URL)
.then(res => res.text())
.then(text => {
    console.log(text);
})
.catch(err => console.log(err));

edited Nov 20, 2020 at 1:44

Emma Marcier

27.8k12 gold badges49 silver badges71 bronze badges

answered Nov 20, 2020 at 1:40

Sarah

832 silver badges3 bronze badges

Comments

Cerebrus · Accepted Answer · 2009-03-25 07:40:46Z

7

You could simply use XmlHttp (AJAX) to hit the required URL and the HTML response from the URL will be available in the responseText property. If it's not the same domain, your users will receive a browser alert saying something like "This page is trying to access a different domain. Do you want to allow this?"

answered Mar 25, 2009 at 7:40

Cerebrus

25.8k8 gold badges58 silver badges71 bronze badges

1 Comment

Alex from Jitbit Over a year ago

Unfortunately, you won't receive any alert, it will just block the request

nickf · Accepted Answer · 2009-03-25 07:37:25Z

5

As a security measure, Javascript can't read files from different domains. Though there might be some strange workaround for it, I'd consider a different language for this task.

answered Mar 25, 2009 at 7:37

nickf

548k199 gold badges660 silver badges727 bronze badges

Comments

kkyy · Accepted Answer · 2009-03-25 07:39:20Z

4

If you absolutely need to use javascript, you could load the page source with an ajax request.

Note that with javascript, you can only retrieve pages that are located under the same domain with the requesting page.

answered Mar 25, 2009 at 7:39

kkyy

12.5k3 gold badges34 silver badges27 bronze badges

Comments

Sergej Andrejev · Accepted Answer · 2009-03-25 07:49:44Z

3

Using jquery

<html>
<head>
<script src="http://jqueryjs.googlecode.com/files/jquery-1.3.2.js" ></script>
</head>
<body>
<script>
$.get("www.google.com", function(response) { alert(response) });
</script>
</body>

answered Mar 25, 2009 at 7:49

Sergej Andrejev

9,47311 gold badges75 silver badges109 bronze badges

1 Comment

karim79 Over a year ago

You can't request a page outside of your domain in this way, you have to do it via proxy, e.g. $.get('mydomain.com/?url=www.google.com')

David Hudman · Accepted Answer · 2017-05-07 20:40:12Z

I used ImportIO. They let you request the HTML from any website if you set up an account with them (which is free). They let you make up to 50k requests per year. I didn't take them time to find an alternative, but I'm sure there are some.

In your Javascript, you'll basically just make a GET request like this:

var request = new XMLHttpRequest();

request.onreadystatechange = function() {
  jsontext = request.responseText;

  alert(jsontext);
}

request.open("GET", "https://extraction.import.io/query/extractor/THE_PUBLIC_LINK_THEY_GIVE_YOU?_apikey=YOUR_KEY&url=YOUR_URL", true);

request.send();

Sidenote: I found this question while researching what I felt like was the same question, so others might find my solution helpful.

UPDATE: I created a new one which they just allowed me to use for less than 48 hours before they said I had to pay for the service. It seems that they shut down your project pretty quick now if you aren't paying. I made my own similar service with NodeJS and a library called NightmareJS. You can see their tutorial here and create your own web scraping tool. It's relatively easy. I haven't tried to set it up as an API that I could make requests to or anything.

Jonathan Gray · Accepted Answer · 2014-10-26 20:58:22Z

1

You can bypass the same-origin-policy by either creating a browser extension or even saving the file as .hta in Windows (HTML Application).

answered Oct 26, 2014 at 20:58

Jonathan Gray

2,60919 silver badges20 bronze badges

Comments

Neville Hillyer · Accepted Answer · 2015-03-06 13:29:30Z

1

Despite many comments to the contrary I believe that it is possible to overcome the same origin requirement with simple JavaScript.

I am not claiming that the following is original because I believe I saw something similar elsewhere a while ago.

I have only tested this with Safari on a Mac.

The following demonstration fetches the page in the base tag and and moves its innerHTML to a new window. My script adds html tags but with most modern browsers this could be avoided by using outerHTML.

<html>
<head>
<base href='http://apod.nasa.gov/apod/'>
<title>test</title>
<style>
body { margin: 0 }
textarea { outline: none; padding: 2em; width: 100%; height: 100% }
</style>
</head>
<body onload="w=window.open('#'); x=document.getElementById('t'); a='<html>\n'; b='\n</html>'; setTimeout('x.innerHTML=a+w.document.documentElement.innerHTML+b; w.close()',2000)">
<textarea id=t></textarea>
</body>
</html>

answered Mar 6, 2015 at 13:29

Neville Hillyer

3641 silver badge10 bronze badges

8 Comments

Neville Hillyer Over a year ago

I use Safari 5.0.6 with webkit patches to update it to the equivalent of more recent versions. Which version of Safari did you use and what happened?

Quentin Over a year ago

8.0.3. Nothing happened other than some errors (which I didn't memorise) appeared in the console.

Neville Hillyer Over a year ago

Which Safari are you using and what exactly were the errors?

Quentin Over a year ago

Still 8.0.3 and if you really want me to reproduce the test case: TypeError: undefined is not an object (evaluating 'w.document')

Quentin Over a year ago

The most likely explanation for what you've managed to do is that you've found a security hole that exists thanks to some combination of your positively ancient browser and the unofficial patches to it. That isn't something of practical use in most cases.

|

inputforcolor · Accepted Answer · 2019-11-21 02:18:42Z

1

javascript:alert("Inspect Element On");
javascript:document.body.contentEditable = 'true';
document.designMode='on'; 
void 0;
javascript:alert(document.documentElement.innerHTML);

Highlight this and drag it to your bookmarks bar and click it when you wanna edit and view the current sites source code.

edited Nov 21, 2019 at 2:18

inputforcolor

9192 gold badges16 silver badges28 bronze badges

answered Nov 20, 2019 at 21:27

Roger Keene

111 bronze badge

Comments

Vatsal Juneja · Accepted Answer · 2012-06-22 16:34:44Z

0

You can generate a XmlHttpRequest and request the page,and then use getResponseText() to get the content.

answered Jun 22, 2012 at 16:34

Vatsal Juneja

3402 gold badges5 silver badges19 bronze badges

Comments

aidanjacobson · Accepted Answer · 2014-10-26 20:14:00Z

0

You can use the FileReader API to get a file, and when selecting a file, put the url of your web page into the selection box. Use this code:

function readFile() {
    var f = document.getElementById("yourfileinput").files[0]; 
    if (f) {
      var r = new FileReader();
      r.onload = function(e) { 
        alert(r.result);
      }
      r.readAsText(f);
    } else { 
      alert("file could not be found")
    }
  }
}

answered Oct 26, 2014 at 20:14

aidanjacobson

2,4291 gold badge21 silver badges18 bronze badges

Comments

Alejandro · Accepted Answer · 2018-06-11 12:47:15Z

0

jquery is not the way of doing things. Do in purre javascript

var r = new XMLHttpRequest();
    r.open('GET', 'yahoo.comm', false);
    r.send(null); 
if (r.status == 200) { alert(r.responseText); }

answered Jun 11, 2018 at 12:47

Alejandro

191 bronze badge

Comments

Steev James · Accepted Answer · 2019-07-31 05:13:27Z

0

<script>
    $.getJSON('http://www.whateverorigin.org/get?url=' + encodeURIComponent('hhttps://example.com/') + '&callback=?', function (data) {
        alert(data.contents);
    });

</script>

Include jQuery and use this code to get HTML of other website. Replace example.com with your website.

This method involves an external server fetching the sites HTML & sending it to you. :)

answered Jul 31, 2019 at 5:13

Steev James

2,6924 gold badges21 silver badges31 bronze badges

Comments

Henrik Schmid · Accepted Answer · 2020-12-21 11:47:55Z

0

On linux

download slimerjs (slimerjs.org)
download firefox version 59
add this environment variable: export SLIMERJSLAUNCHER=/home/en/Letöltések/firefox59/firefox/firefox

on slimerjs download page use this .js program (./slomerjs program.js):

 var page = require('webpage').create();
 page.open(
  'http://www.google.com/search?q=görény',
   function() 
   {
     page.render('goo2.pdf');
     phantom.exit();
   }
 );

Use pdftotext to get text on the page.

answered Dec 21, 2020 at 11:47

Henrik Schmid

11 bronze badge

Comments

Zezo Android · Accepted Answer · 2022-05-19 09:24:33Z

0



    const URL = 'https://wwww.w3schools.com';
    fetch(URL)
    .then(res => res.text())
    .then(text => {
        console.log(text);
    })
    .catch(err => console.log(err));










    const URL = 'https://www.sap.com/belgique/index.html';
    fetch(URL)
    .then(res => res.text())
    .then(text => {
        console.log(text);
    })
    .catch(err => console.log(err));

answered May 19, 2022 at 9:24

Zezo Android

32 bronze badges

Collectives™ on Stack Overflow

Can Javascript read the source of any web page?

17 Answers 17

3 Comments

5 Comments

Comments

1 Comment

Comments

Comments

1 Comment

Comments

Comments

8 Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

17 Answers 17

3 Comments

5 Comments

Comments

1 Comment

Comments

Comments

1 Comment

Comments

Comments

8 Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related