6

I want to develop a web crawler which starts from a seed URL and then crawls 100 html pages it finds belonging to the same domain as the seed URL as well as keeps a record of the traversed URLs avoiding duplicates. I have written the following but the $url_count value does not seem to be incremented and the retrieved URLs contain links even from other domains. How do I solve this? Here I have inserted stackoverflow.com as my starting URL.

use strict;
use warnings;

use LWP::Simple;
use LWP::UserAgent;
use HTTP::Request;
use HTTP::Response;


##open file to store links
open my $file1,">>", ("extracted_links.txt");
select($file1); 

##starting URL
my @urls = 'http://stackoverflow.com/';

my $browser = LWP::UserAgent->new('IE 6');
$browser->timeout(10);
my %visited;
my $url_count = 0;


while (@urls) 
{
     my $url = shift @urls;
     if (exists $visited{$url}) ##check if URL already exists
     {
         next;
     }
     else
     {
         $url_count++;
     }         

     my $request = HTTP::Request->new(GET => $url);
     my $response = $browser->request($request);

     if ($response->is_error()) 
     {
         printf "%s\n", $response->status_line;
     }
     else
     {
         my $contents = $response->content();
         $visited{$url} = 1;
         @lines = split(/\n/,$contents);
         foreach $line(@lines)
         {
             $line =~ m@(((http\:\/\/)|(www\.))([a-z]|[A-Z]|[0-9]|[/.]|[~]|[-_]|[()])*[^'">])@g;
             print "$1\n";  
             push @urls, $$line[2];
         }

         sleep 60;

         if ($visited{$url} == 100)
         {
            last;
         }
    }
}

close $file1;
3
  • See this link to get the root domain name of the links and compare that to the root domain of your initial URL: stackoverflow.com/questions/15627892/… Commented Mar 29, 2013 at 2:59
  • Since you're going to be extracting URLs and links, start using WWW::Mechanize which takes care of much of the drudgery for you. Commented Apr 4, 2013 at 4:21
  • I cannot use that because I am supposed to run the codes on a server which does not have that package and I do not have the permission to install them. Commented Apr 4, 2013 at 4:23

1 Answer 1

4

Several points, your URL parsing is fragile, you certainly won't get relative links. Also you don't test for 100 links but 100 matches of the current url, which almost certainly isn't what you mean. Finally, I'm not too familiar with LWP so I'm going to show an example using the Mojolicious suite of tools.

This seems to work, perhaps it will give you some ideas.

#!/usr/bin/env perl

use strict;
use warnings;

use Mojo::UserAgent;
use Mojo::URL;

##open file to store links
open my $log, '>', 'extracted_links.txt' or die $!;

##starting URL
my $base = Mojo::URL->new('http://stackoverflow.com/');
my @urls = $base;

my $ua = Mojo::UserAgent->new;
my %visited;
my $url_count = 0;

while (@urls) {
  my $url = shift @urls;
  next if exists $visited{$url};

  print "$url\n";
  print $log "$url\n";

  $visited{$url} = 1;
  $url_count++;         

  # find all <a> tags and act on each
  $ua->get($url)->res->dom('a')->each(sub{
    my $url = Mojo::URL->new($_->{href});
    if ( $url->is_abs ) {
      return unless $url->host eq $base->host;
    }
    push @urls, $url;
  });

  last if $url_count == 100;

  sleep 1;
}
Sign up to request clarification or add additional context in comments.

3 Comments

Thanks for the reply. But I could not try out your code due to missing Mojolicious tool package.
Its very easy to install. The one-liner is this: curl get.mojolicio.us | sh
hi Joel, thanks for your code snippet. But I think it needs a tweak to resolve relative links otherwise the page get won't work. To fix it I created a variable called $baseURL to hold the starting url (in your example 'stackoverflow.com') then I changed your code as follows: if ( $url->is_abs ) { return unless $url->host eq $base->host; } else { $url = Mojo::URL->new($baseURL)->path($_); } push @urls, $url;

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.