Webscraping without Node js possible?

Question

I have currently a simple webpage which just consists out of a .js, .css .html file. I do not want to use any Node.js stuff.

Regarding these limits I would like to ask if it is possible to search content of external webpages using javascript (e.g. running a webworker in background).

E.g. I would like to do:

Get first url link of a google image search.

Edit:

I now tried it and it worked find however after 2 Weeks I get now this error:

Cross-Origin Request Blocked: The Same Origin Policy disallows reading the remote resource at .... (Reason: CORS header ‘Access-Control-Allow-Origin’ missing).

any ideas how to solve that?

Here is the error described by firefox: https://developer.mozilla.org/en-US/docs/Web/HTTP/CORS/Errors/CORSMissingAllowOrigin

If the website you're trying to scrape doesnt support CORS, you can't circumvent the issue without a server to proxy the request. — zero298
– zero298, Commented Apr 25, 2019 at 20:00

wizzwizz4 · Accepted Answer · 2019-04-13 13:55:31Z

3

Yes, this is possible. Just use the XMLHttpRequest API:

var request = new XMLHttpRequest();
request.open("GET", "https://bypasscors.herokuapp.com/api/?url=" + encodeURIComponent("https://duckduckgo.com/html/?q=stack+overflow"), true);  // last parameter must be true
request.responseType = "document";
request.onload = function (e) {
  if (request.readyState === 4) {
    if (request.status === 200) {
      var a = request.responseXML.querySelector("div.result:nth-child(1) > div:nth-child(1) > h2:nth-child(1) > a:nth-child(1)");
      console.log(a.href);
      document.body.appendChild(a);
    } else {
      console.error(request.status, request.statusText);
    }
  }
};
request.onerror = function (e) {
  console.error(request.status, request.statusText);
};
request.send(null);  // not a POST request, so don't send extra data

Note that I had to use a proxy to bypass CORS issues; if you want to do this, run your own proxy on your own server.

edited Apr 13, 2019 at 13:55

answered Apr 13, 2019 at 10:28

wizzwizz4

6,5212 gold badges29 silver badges68 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

sqp_125 Over a year ago

How can I use this now to get vocabulary data from dict.leo.org/englisch-deutsch/hallo I tried with that url but all I get is <a class="result__a" rel="nofollow" href="dict.leo.org/englisch-deutsch/hallo"> and not the result in german (deutsch) which is what I want

wizzwizz4 Over a year ago

@sqp_125 Just read the URL (a.href) and then request that page using the same method. Also, make sure you set up your own CORS proxy for the actual code; it's extremely impolite to use other people's servers in this way without their permission.

sqp_125 Over a year ago

Sry I need more explanation about that. a.href gives me: dict.leo.org/englisch-deutsch/hallo do you have a link for how to setup such a CORS proxy?

wizzwizz4 Over a year ago

@sqp_125 That's good! Now just run the same code, but with encodeURIComponent(a.href) instead. To set up a CORS proxy, set up a normal proxy but ensure that it returns the header Access-Control-Allow-Origin: *. Here's a reference implementation in Node.JS.

sqp_125 Over a year ago

Wow Thanks so much this works! Do you know if I also can transform your js code to brython using a Ajax request? brython.info/static_doc/en/ajax.html I would like to use this cause I would like to code only in python and tranform my code then with brython to js (which is done automatically) Thank you so much!

|

Jimmy Breck-McKye · Accepted Answer · 2019-04-13 10:17:03Z

3

Yes, it is theoretically possible to do “web scraping” (i.e. parsing webpages) on the client. There are several restrictions however and I would question why you wouldn’t choose a program that runs on a server or desktop instead.

Web workers are able to request HTML content using XMLHttpRequest, and then parse the incoming XML programmatically. Note that the target webpage must send the appropriate CORS headers if it belongs to a foreign domain. You could then pick out content from the resulting HTML.

Parsing content generated with CSS and JavaScript will be harder. You will either have to construct sandboxed content on your host page from the input stream, or run some kind of parser, which doesn’t seem very feasible.

In short, the answer to your question is yes, because you have the tools to do a network request and a Turing-complete language with which to build any kind of parsing and scraping that you wanted. So technically anything is possible.

But the real question is: would it be wise? Would you ever choose this approach when other technologies are at hand? Well, no. For most cases I don’t see why you wouldn’t just write a server side program using e.g. headless Chrome.

If you don’t want to use Node - or aren’t able to deploy Node for some reason - there are many web scraping packages and prior art in languages such as Go, C, Java and Python. Search the package manager of your preferred programming language and you will likely find several.

edited Apr 13, 2019 at 10:17

answered Apr 13, 2019 at 10:11

Jimmy Breck-McKye

3,0641 gold badge24 silver badges33 bronze badges

4 Comments

sqp_125 Over a year ago

Nice reply thanks. Do you have a minimal example or tutorial to get started using just js? (even if it is not wise?) Yap I saw nice packages for python. However, I would then have to write a python programm which cannot be launched directly in the browser (e.g. for brython no selenium or liburl2 liburl package is available yet).

Jimmy Breck-McKye Over a year ago

Why do you need to run the program in the browser?

sqp_125 Over a year ago

Well cause I would like to access it from everywhere (my smartphone etc.) and everyone should be able to use it without downloading stuff.

Jimmy Breck-McKye Over a year ago

It sounds like you need a server that runs the scraper, and provides a web-based interface for starting the scraper process and sending the results back to the user asynchronously (e.g. by email).

DisappointedByUnaccountableMod · Accepted Answer · 2021-12-04 11:02:57Z

0

I heard about python for scraping too, but nodejs + puppeteer kick ass... And is pretty easy to learn

edited Dec 4, 2021 at 11:02

DisappointedByUnaccountableMod

6,8444 gold badges21 silver badges23 bronze badges

answered Apr 13, 2019 at 11:45

Luis Alfredo Serrano Díaz

9421 gold badge11 silver badges22 bronze badges

Collectives™ on Stack Overflow

Webscraping without Node js possible?

3 Answers 3

6 Comments

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

6 Comments

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related