0

I have currently a simple webpage which just consists out of a .js, .css .html file. I do not want to use any Node.js stuff.

Regarding these limits I would like to ask if it is possible to search content of external webpages using javascript (e.g. running a webworker in background).

E.g. I would like to do:

Get first url link of a google image search.

Edit:

I now tried it and it worked find however after 2 Weeks I get now this error:

Cross-Origin Request Blocked: The Same Origin Policy disallows reading the remote resource at .... (Reason: CORS header ‘Access-Control-Allow-Origin’ missing).

any ideas how to solve that?

Here is the error described by firefox: https://developer.mozilla.org/en-US/docs/Web/HTTP/CORS/Errors/CORSMissingAllowOrigin

1
  • 1
    If the website you're trying to scrape doesnt support CORS, you can't circumvent the issue without a server to proxy the request. Commented Apr 25, 2019 at 20:00

3 Answers 3

3

Yes, this is possible. Just use the XMLHttpRequest API:

var request = new XMLHttpRequest();
request.open("GET", "https://bypasscors.herokuapp.com/api/?url=" + encodeURIComponent("https://duckduckgo.com/html/?q=stack+overflow"), true);  // last parameter must be true
request.responseType = "document";
request.onload = function (e) {
  if (request.readyState === 4) {
    if (request.status === 200) {
      var a = request.responseXML.querySelector("div.result:nth-child(1) > div:nth-child(1) > h2:nth-child(1) > a:nth-child(1)");
      console.log(a.href);
      document.body.appendChild(a);
    } else {
      console.error(request.status, request.statusText);
    }
  }
};
request.onerror = function (e) {
  console.error(request.status, request.statusText);
};
request.send(null);  // not a POST request, so don't send extra data

Note that I had to use a proxy to bypass CORS issues; if you want to do this, run your own proxy on your own server.

Sign up to request clarification or add additional context in comments.

6 Comments

How can I use this now to get vocabulary data from dict.leo.org/englisch-deutsch/hallo I tried with that url but all I get is <a class="result__a" rel="nofollow" href="dict.leo.org/englisch-deutsch/hallo"> and not the result in german (deutsch) which is what I want
@sqp_125 Just read the URL (a.href) and then request that page using the same method. Also, make sure you set up your own CORS proxy for the actual code; it's extremely impolite to use other people's servers in this way without their permission.
Sry I need more explanation about that. a.href gives me: dict.leo.org/englisch-deutsch/hallo do you have a link for how to setup such a CORS proxy?
@sqp_125 That's good! Now just run the same code, but with encodeURIComponent(a.href) instead. To set up a CORS proxy, set up a normal proxy but ensure that it returns the header Access-Control-Allow-Origin: *. Here's a reference implementation in Node.JS.
Wow Thanks so much this works! Do you know if I also can transform your js code to brython using a Ajax request? brython.info/static_doc/en/ajax.html I would like to use this cause I would like to code only in python and tranform my code then with brython to js (which is done automatically) Thank you so much!
|
3

Yes, it is theoretically possible to do “web scraping” (i.e. parsing webpages) on the client. There are several restrictions however and I would question why you wouldn’t choose a program that runs on a server or desktop instead.

Web workers are able to request HTML content using XMLHttpRequest, and then parse the incoming XML programmatically. Note that the target webpage must send the appropriate CORS headers if it belongs to a foreign domain. You could then pick out content from the resulting HTML.

Parsing content generated with CSS and JavaScript will be harder. You will either have to construct sandboxed content on your host page from the input stream, or run some kind of parser, which doesn’t seem very feasible.

In short, the answer to your question is yes, because you have the tools to do a network request and a Turing-complete language with which to build any kind of parsing and scraping that you wanted. So technically anything is possible.

But the real question is: would it be wise? Would you ever choose this approach when other technologies are at hand? Well, no. For most cases I don’t see why you wouldn’t just write a server side program using e.g. headless Chrome.

If you don’t want to use Node - or aren’t able to deploy Node for some reason - there are many web scraping packages and prior art in languages such as Go, C, Java and Python. Search the package manager of your preferred programming language and you will likely find several.

4 Comments

Nice reply thanks. Do you have a minimal example or tutorial to get started using just js? (even if it is not wise?) Yap I saw nice packages for python. However, I would then have to write a python programm which cannot be launched directly in the browser (e.g. for brython no selenium or liburl2 liburl package is available yet).
Why do you need to run the program in the browser?
Well cause I would like to access it from everywhere (my smartphone etc.) and everyone should be able to use it without downloading stuff.
It sounds like you need a server that runs the scraper, and provides a web-based interface for starting the scraper process and sending the results back to the user asynchronously (e.g. by email).
0

I heard about python for scraping too, but nodejs + puppeteer kick ass... And is pretty easy to learn

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.