Web crawler using perl

Question

I want to develop a web crawler which starts from a seed URL and then crawls 100 html pages it finds belonging to the same domain as the seed URL as well as keeps a record of the traversed URLs avoiding duplicates. I have written the following but the $url_count value does not seem to be incremented and the retrieved URLs contain links even from other domains. How do I solve this? Here I have inserted stackoverflow.com as my starting URL.

use strict;
use warnings;

use LWP::Simple;
use LWP::UserAgent;
use HTTP::Request;
use HTTP::Response;


##open file to store links
open my $file1,">>", ("extracted_links.txt");
select($file1); 

##starting URL
my @urls = 'http://stackoverflow.com/';

my $browser = LWP::UserAgent->new('IE 6');
$browser->timeout(10);
my %visited;
my $url_count = 0;


while (@urls) 
{
     my $url = shift @urls;
     if (exists $visited{$url}) ##check if URL already exists
     {
         next;
     }
     else
     {
         $url_count++;
     }         

     my $request = HTTP::Request->new(GET => $url);
     my $response = $browser->request($request);

     if ($response->is_error()) 
     {
         printf "%s\n", $response->status_line;
     }
     else
     {
         my $contents = $response->content();
         $visited{$url} = 1;
         @lines = split(/\n/,$contents);
         foreach $line(@lines)
         {
             $line =~ m@(((http\:\/\/)|(www\.))([a-z]|[A-Z]|[0-9]|[/.]|[~]|[-_]|[()])*[^'">])@g;
             print "$1\n";  
             push @urls, $$line[2];
         }

         sleep 60;

         if ($visited{$url} == 100)
         {
            last;
         }
    }
}

close $file1;

See this link to get the root domain name of the links and compare that to the root domain of your initial URL: stackoverflow.com/questions/15627892/… — imran
– imran, Commented Mar 29, 2013 at 2:59
Since you're going to be extracting URLs and links, start using WWW::Mechanize which takes care of much of the drudgery for you. — Andy Lester
– Andy Lester, Commented Apr 4, 2013 at 4:21
I cannot use that because I am supposed to run the codes on a server which does not have that package and I do not have the permission to install them. — user2154731
– user2154731, Commented Apr 4, 2013 at 4:23

Joel Berger · Accepted Answer · 2013-03-29 13:12:02Z

4

Several points, your URL parsing is fragile, you certainly won't get relative links. Also you don't test for 100 links but 100 matches of the current url, which almost certainly isn't what you mean. Finally, I'm not too familiar with LWP so I'm going to show an example using the Mojolicious suite of tools.

This seems to work, perhaps it will give you some ideas.

#!/usr/bin/env perl

use strict;
use warnings;

use Mojo::UserAgent;
use Mojo::URL;

##open file to store links
open my $log, '>', 'extracted_links.txt' or die $!;

##starting URL
my $base = Mojo::URL->new('http://stackoverflow.com/');
my @urls = $base;

my $ua = Mojo::UserAgent->new;
my %visited;
my $url_count = 0;

while (@urls) {
  my $url = shift @urls;
  next if exists $visited{$url};

  print "$url\n";
  print $log "$url\n";

  $visited{$url} = 1;
  $url_count++;         

  # find all <a> tags and act on each
  $ua->get($url)->res->dom('a')->each(sub{
    my $url = Mojo::URL->new($_->{href});
    if ( $url->is_abs ) {
      return unless $url->host eq $base->host;
    }
    push @urls, $url;
  });

  last if $url_count == 100;

  sleep 1;
}

edited Mar 29, 2013 at 13:12

answered Mar 29, 2013 at 3:35

Joel Berger

20.3k5 gold badges52 silver badges106 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

user2154731 Over a year ago

Thanks for the reply. But I could not try out your code due to missing Mojolicious tool package.

Joel Berger Over a year ago

Its very easy to install. The one-liner is this: curl get.mojolicio.us | sh

Andy Lorenz Over a year ago

hi Joel, thanks for your code snippet. But I think it needs a tweak to resolve relative links otherwise the page get won't work. To fix it I created a variable called $baseURL to hold the starting url (in your example 'stackoverflow.com') then I changed your code as follows:

if ( $url->is_abs ) { return unless $url->host eq $base->host; } else { $url = Mojo::URL->new($baseURL)->path($_); } push @urls, $url;

Collectives™ on Stack Overflow

Web crawler using perl

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related