1

i just want to say that i understand that you can't parse HTML with regexes. i get that. you cannot parse HTML with regex.

I am just getting a few urls from a webpage.

the output is a little strange - there a a new line after the closing anchor tag.

<A HREF="tmtrack.dll?IssuePage&SolutionId=8&RecordId=20193&Template=view&TableId
=1023"><B>26165</B></A>

<A HREF="tmtrack.dll?IssuePage&SolutionId=8&RecordId=21811&Template=view&TableId
=1023"><B>28722</B></A>

<A HREF="tmtrack.dll?IssuePage&SolutionId=8&RecordId=22163&Template=view&TableId
 =1023"><B>29327</B></A>

<A HREF="tmtrack.dll?IssuePage&SolutionId=8&RecordId=22238&Template=view&TableId
=1023"><B>29450</B></A>

So i write this little script to make it neater.

#!/usr/bin/perl
use strict;
use warnings ;
my $list = "/tmp/rawurl_list";
open( my $filehandle ,"<", "$list") or die $!;
while (<$filehandle>) {
    s/\n//g;
    s/\<\/A\>/\n/g;
    print $_ ;
        if ($_ =~ /^<A HREF="(.*)"/) {
           print $1;
        }
}

and this is what i get

<A HREF="tmtrack.dll?  IssuePage&SolutionId=8&RecordId=20193&Template=view&TableId=1023"><B>26165</B>
<A HREF="tmtrack.dll?IssuePage&SolutionId=8&RecordId=21811&Template=view&TableId=1023"><B>28722</B>
<A HREF="tmtrack.dll?IssuePage&SolutionId=8&RecordId=22163&Template=view&TableId=1023"><B>29327</B>
<A HREF="tmtrack.dll?IssuePage&SolutionId=8&RecordId=22238&Template=view&TableId=1023"><B>29450</B>

But i am havin trouble stripping off the \A HREF tag.

The HREF regex must be ok - it works on the one liner.

bash-3.00$ /casper/strip | perl -nle 'print /^<A\sHREF="(.*)"/'
tmtrack.dll?IssuePage&SolutionId=8&RecordId=20193&Template=view&TableId=1023
tmtrack.dll?IssuePage&SolutionId=8&RecordId=21811&Template=view&TableId=1023
tmtrack.dll?IssuePage&SolutionId=8&RecordId=22163&Template=view&TableId=1023
tmtrack.dll?IssuePage&SolutionId=8&RecordId=22238&Template=view&TableId=1023

i must be doing something wrong with the script - i need to learn why this does not strip off the html tags. I am posting this because I run into this error all the time and just end up using the perl extract from the command line instead of withing a script. I am not learning past this.

6
  • You can parse html with regex, but there is other tools that are build to do this task. Commented Nov 21, 2013 at 22:31
  • are the replaces ok - should they be dereferenced to something like $_ = s/\n//g; or is that implied? how could i make the logic of this script better and make it work. Commented Nov 21, 2013 at 22:31
  • $_ =~ is implied. ($_ = would do something very different.) Commented Nov 21, 2013 at 22:53
  • is the if statement ok - i think that is the issue - does it still have the ability to read from the while or is it just dangling off of the main - waiting for input? like the first two substitues strip off the new lines and the end tags - that should leave the leading anchor tag a tthe front of each line. Commented Nov 21, 2013 at 23:04
  • @casper: have you read my answer? have you tried it? Commented Nov 22, 2013 at 2:44

3 Answers 3

4

Your script is only reading one line at a time; the ending " is only encountered on the following iteration of the while loop. If you want to read one link at a time, try adding:

local $/ = '</A>';

before the while(). (See $/.)

Sign up to request clarification or add additional context in comments.

Comments

1

One solution that checks if a line begins with <A to append the next one and do the regular expression matching to extract first grouped expression:

#!/usr/bin/env perl

use warnings;
use strict;

my $list = "/tmp/rawurl_list";
open( my $filehandle ,"<", "$list") or die $!; 
while (<$filehandle>) {
    chomp;
    if ( m/^<A/ ) { 
        $_ .= <$filehandle>;
        if ($_ =~ /^<A HREF="(.*)"/) {
           print "$1\n";
        }       
    }   
}

It yields:

tmtrack.dll?IssuePage&SolutionId=8&RecordId=20193&Template=view&TableId=1023
tmtrack.dll?IssuePage&SolutionId=8&RecordId=21811&Template=view&TableId=1023
tmtrack.dll?IssuePage&SolutionId=8&RecordId=22163&Template=view&TableId =1023
tmtrack.dll?IssuePage&SolutionId=8&RecordId=22238&Template=view&TableId=1023

Comments

1

replace in your code s/\<\/A\>/\n/g; by s/\<\/A\>\K/\n/g; or s/(?<=<\/A>)/\n/g

Since \K resets the match before it, your closing tag is not removed.

Note: As far i know, you don't need to escape < and >

Note2: the href part of your code works only because the dot doesn't match newlines by default .* match all the line, then the regex engine backtracks to find the double quote). A better way is to use a lazy quantifier instead: <A\s+HREF="(.*?)". A more better way is to use \S* instead: <A\s+HREF="(\S*)" (only one backtrack step for the double quote, since an URL doesn't have white spaces inside). Or <A\s+HREF="([^"]+)" that avoid to match double quotes.

1 Comment

yeah - even when i change the regex in the if statement, it does not work. i think that that is the problem - the way that the if statement is configured.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.