3

I am trying to write a minimal web crawler. The aim is to discover new URLs from the seed and crawl these new URLs further. The code is as follows:

use strict;
use warnings;
use Carp;
use Data::Dumper;
use WWW::Mechanize;

my $url = "http://foobar.com"; # example
my %links;

my $mech = WWW::Mechanize->new(autocheck => 1);
$mech->get($url);
my @cr_fronteir = $mech->find_all_links();

foreach my $links (@cr_fronteir) {
    if ( $links->[0] =~ m/^http/xms ) {
        $links{$links->[0]} = $links->[1];
    }
}

I am stuck here, how could I proceed further to crawl the links in %links and also, how do I add depth to prevent overflow. Suggestion are appreciated.

3 Answers 3

5

Mojolicious web framework offer some interesting features useful for web crawlers:

  • No dependencies except Perl v5.10 or later
  • URL parser
  • DOM tree parser
  • Asynchronous HTTP/HTTPS client (allows concurrent requests with no fork() overhead)

Here is an example which recursively crawls a local Apache documentation and displays page titles and extracted links. It uses 4 parallel connections and doesn't goes deeper than 3 path levels, visiting each extracted link only once:

#!/usr/bin/env perl
use 5.010;
use open qw(:locale);
use strict;
use utf8;
use warnings qw(all);

use Mojo::UserAgent;

# FIFO queue
my @urls = (Mojo::URL->new('http://localhost/manual/'));

# User agent following up to 5 redirects
my $ua = Mojo::UserAgent->new(max_redirects => 5);

# Track accessed URLs
my %uniq;

my $active = 0;

sub parse {
    my ($tx) = @_;

    # Request URL
    my $url = $tx->req->url;

    say "\n$url";
    say $tx->res->dom->at('html title')->text;

    # Extract and enqueue URLs
    for my $e ($tx->res->dom('a[href]')->each) {

        # Validate href attribute
        my $link = Mojo::URL->new($e->{href});
        next if 'Mojo::URL' ne ref $link;

        # "normalize" link
        $link = $link->to_abs($tx->req->url)->fragment(undef);
        next unless $link->protocol =~ /^https?$/x;

        # Don't go deeper than /a/b/c
        next if @{$link->path->parts} > 3;

        # Access every link only once
        next if ++$uniq{$link->to_string} > 1;

        # Don't visit other hosts
        next if $link->host ne $url->host;

        push @urls, $link;
        say " -> $link";
    }

    return;
}

sub get_callback {
    my (undef, $tx) = @_;

    # Parse only OK HTML responses
    $tx->res->code == 200
        and
    $tx->res->headers->content_type =~ m{^text/html\b}ix
        and
    parse($tx);

    # Deactivate
    --$active;

    return;
}

Mojo::IOLoop->recurring(
    0 => sub {

        # Keep up to 4 parallel crawlers sharing the same user agent
        for ($active .. 4 - 1) {

            # Dequeue or halt if there are no active crawlers anymore
            return ($active or Mojo::IOLoop->stop)
                unless my $url = shift @urls;

            # Fetch non-blocking just by adding
            # a callback and marking as active
            ++$active;
            $ua->get($url => \&get_callback);
        }
    }
);

# Start event loop if necessary
Mojo::IOLoop->start unless Mojo::IOLoop->is_running;

For more web scraping tips & tricks, read the I Don’t Need No Stinking API: Web Scraping For Fun and Profit article.

Sign up to request clarification or add additional context in comments.

Comments

4

Can't have recursion without making it a function.

use strict;
use warnings;
use Carp; #unused, but I guess yours was a sample
use Data::Dumper;
use WWW::Mechanize;

my %links;
my $mech = WWW::Mechanize->new(autocheck => 1);

sub crawl {
    my $url = shift;
    my $depth = shift or 0;
    #this seems like a good place to assign some form of callback, so you can
    # generalize this function

    return if $depth > 10; #change as needed

    $mech->get($url);
    my @cr_fronteir = $mech->find_all_links();

    #not so sure what you're trying to do; before, $links in the
    # foreach overrides the global %links
    #perhaps you meant this...?
    foreach my $link (@cr_fronteir) {
        if ($link->[0] =~ m/^http/xms) {
            $links{$link->[0]} = $link->[1];

            #be nice to servers - try not to overload them
            sleep 3;
            #recursion!
            crawl( $link->[0], depth+1 );
        }
    }
}

crawl("http://foobar.com", 0);

I don't have Perl installed on this partition, so this is prone to syntax-errors and other mischief, but could serve as a basis.

As said in the first function comment: Instead of hard-coding the mapping functionality, you can generalize your function for greater glory by passing it a callback, and calling that for every link you crawl.

Comments

0

Some pseudo code:

while ( scalar @links ) {
    my $link = shift @links;
    process_link($link);
}

sub process_link {
    my $link = shift;

    $mech->get($link);
    foreach my $page_link ( $mech->find_all_links() ) {
        next if $links{$page_link};
        $links{$page_links} = 1;
        push @links, $page_link;
    }
}

P. S. /m and /s modifiers are unnecessary in your code (and /x too).

1 Comment

/m, /s and /x flags: various Perl style guides recommend placing these on each and every regex. the /ms changes some newbie-unfriendly regex behaviours, while the /x is just ever so useful ;-) I, too, always annotate my regexes with these three flags, directly needed or not.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.