Web Crawler with JavaScript support in Perl?

Question

I want to code a perl application that would crawl some websites and collect images and links from such webpages. Because the most of pages use JavaScript that generate a HTML content, I need to code quasi a client browser with JavaScript support to be able to parse a final HTML code that is generated and/or modified by JavaScript. What are my options?

If possible, please publish some implementation code or link to some example(s).

possible duplicate of How can I handle Javascript in a Perl web crawler? — Ilmari Karonen
– Ilmari Karonen, Commented Mar 5, 2012 at 1:41

2 revs · Accepted Answer · 2013-11-18 23:12:24Z

10

There are several options.

edited Nov 18, 2013 at 23:12

community wiki

2 revs
Quentin

Sign up to request clarification or add additional context in comments.

2 Comments

MkV Over a year ago

"What are my options?" was the question. A list of modules seems like a good list of options.

Quentin Over a year ago

It's Perl. The documentation for modules tends to have code in it.

Trott · Accepted Answer · 2012-03-04 23:39:38Z

6

Options that spring to mind:

You could have Perl use Selenium and have a full-blown browser do the work for you.
You can download and compile V8 or another open source JavaScript engine and have Perl call an external program to evaluate the JavaScript.
I don't think Perl's LWP module supports JavaScript, but you might want to check that if you haven't done so already.

answered Mar 4, 2012 at 23:39

Trott

70.6k27 gold badges184 silver badges218 bronze badges

Comments

MkV · Accepted Answer · 2012-03-05 01:00:50Z

6

WWW::Scripter with the WWW::Scripter::Plugin::JavaScript and WWW::Scripter::Plugin::Ajax plugins seems like the closest you'll get without using an actual browser (the modules WWW::Selenium, Mozilla::Mechanize or Win32::IE::Mechanize use real browsers).

answered Mar 5, 2012 at 1:00

MkV

3,09623 silver badges17 bronze badges

1 Comment

abbypan Over a year ago

is any module like WWW::Scripter support with V8 engine?

creaktive · Accepted Answer · 2013-01-30 13:59:25Z

Check the complete working example featured in the Scraping pages full of JavaScript. It uses Web::Scraper for HTML processing and Gtk3::WebKit to process dynamic content. However, the later one is quite a PITA to install. If there are not-that-many pages you need to scrape (< 1000), fetching the post-processed DOM content through PhantomJS is an interesting option. I've written the following script for that purpose:

var page = require('webpage').create(),
    system = require('system'),
    fs = require('fs'),
    address, output;

if (system.args.length < 3 || system.args.length > 5) {
    console.log('Usage: phantomjs --load-images=no html.js URL filename');
    phantom.exit(1);
} else {
    address = system.args[1];
    output = system.args[2];
    page.open(address, function (status) {
        if (status !== 'success') {
            console.log('Unable to load the address!');
        } else {
            fs.write(output, page.content, 'w');
        }
        phantom.exit();
    });
}

There's something like that on the CPAN already, it's a module called Wight, but I haven't tested it yet.

abbypan · Accepted Answer · 2013-07-15 02:38:30Z

-1

WWW::Mechanize::Firefox can use with mozrepl, with all javascript action.

edited Jul 15, 2013 at 2:38

answered Jan 30, 2013 at 13:25

abbypan

1581 silver badge12 bronze badges

Collectives™ on Stack Overflow

Web Crawler with JavaScript support in Perl?

5 Answers 5

2 Comments

Comments

1 Comment

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

2 Comments

Comments

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related