parsing html with regex - using capture and $1 parameter

Question

i just want to say that i understand that you can't parse HTML with regexes. i get that. you cannot parse HTML with regex.

I am just getting a few urls from a webpage.

the output is a little strange - there a a new line after the closing anchor tag.

<A HREF="tmtrack.dll?IssuePage&SolutionId=8&RecordId=20193&Template=view&TableId
=1023"><B>26165</B></A>

<A HREF="tmtrack.dll?IssuePage&SolutionId=8&RecordId=21811&Template=view&TableId
=1023"><B>28722</B></A>

<A HREF="tmtrack.dll?IssuePage&SolutionId=8&RecordId=22163&Template=view&TableId
 =1023"><B>29327</B></A>

<A HREF="tmtrack.dll?IssuePage&SolutionId=8&RecordId=22238&Template=view&TableId
=1023"><B>29450</B></A>

So i write this little script to make it neater.

#!/usr/bin/perl
use strict;
use warnings ;
my $list = "/tmp/rawurl_list";
open( my $filehandle ,"<", "$list") or die $!;
while (<$filehandle>) {
    s/\n//g;
    s/\<\/A\>/\n/g;
    print $_ ;
        if ($_ =~ /^<A HREF="(.*)"/) {
           print $1;
        }
}

and this is what i get

<A HREF="tmtrack.dll?  IssuePage&SolutionId=8&RecordId=20193&Template=view&TableId=1023"><B>26165</B>
<A HREF="tmtrack.dll?IssuePage&SolutionId=8&RecordId=21811&Template=view&TableId=1023"><B>28722</B>
<A HREF="tmtrack.dll?IssuePage&SolutionId=8&RecordId=22163&Template=view&TableId=1023"><B>29327</B>
<A HREF="tmtrack.dll?IssuePage&SolutionId=8&RecordId=22238&Template=view&TableId=1023"><B>29450</B>

But i am havin trouble stripping off the \A HREF tag.

The HREF regex must be ok - it works on the one liner.

bash-3.00$ /casper/strip | perl -nle 'print /^<A\sHREF="(.*)"/'
tmtrack.dll?IssuePage&SolutionId=8&RecordId=20193&Template=view&TableId=1023
tmtrack.dll?IssuePage&SolutionId=8&RecordId=21811&Template=view&TableId=1023
tmtrack.dll?IssuePage&SolutionId=8&RecordId=22163&Template=view&TableId=1023
tmtrack.dll?IssuePage&SolutionId=8&RecordId=22238&Template=view&TableId=1023

i must be doing something wrong with the script - i need to learn why this does not strip off the html tags. I am posting this because I run into this error all the time and just end up using the perl extract from the command line instead of withing a script. I am not learning past this.

You can parse html with regex, but there is other tools that are build to do this task. — Casimir et Hippolyte
– Casimir et Hippolyte, Commented Nov 21, 2013 at 22:31
are the replaces ok - should they be dereferenced to something like $_ = s/\n//g; or is that implied? how could i make the logic of this script better and make it work. — capser
– capser, Commented Nov 21, 2013 at 22:31
$_ =~ is implied. ($_ = would do something very different.) — ysth
– ysth, Commented Nov 21, 2013 at 22:53
is the if statement ok - i think that is the issue - does it still have the ability to read from the while or is it just dangling off of the main - waiting for input? like the first two substitues strip off the new lines and the end tags - that should leave the leading anchor tag a tthe front of each line. — capser
– capser, Commented Nov 21, 2013 at 23:04

ysth · Accepted Answer · 2013-11-21 22:23:58Z

4

Your script is only reading one line at a time; the ending " is only encountered on the following iteration of the while loop. If you want to read one link at a time, try adding:

local $/ = '</A>';

before the while(). (See $/.)

answered Nov 21, 2013 at 22:23

ysth

99.1k6 gold badges126 silver badges219 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Birei · Accepted Answer · 2013-11-21 22:30:52Z

1

One solution that checks if a line begins with <A to append the next one and do the regular expression matching to extract first grouped expression:

#!/usr/bin/env perl

use warnings;
use strict;

my $list = "/tmp/rawurl_list";
open( my $filehandle ,"<", "$list") or die $!; 
while (<$filehandle>) {
    chomp;
    if ( m/^<A/ ) { 
        $_ .= <$filehandle>;
        if ($_ =~ /^<A HREF="(.*)"/) {
           print "$1\n";
        }       
    }   
}

It yields:

tmtrack.dll?IssuePage&SolutionId=8&RecordId=20193&Template=view&TableId=1023
tmtrack.dll?IssuePage&SolutionId=8&RecordId=21811&Template=view&TableId=1023
tmtrack.dll?IssuePage&SolutionId=8&RecordId=22163&Template=view&TableId =1023
tmtrack.dll?IssuePage&SolutionId=8&RecordId=22238&Template=view&TableId=1023

edited Nov 21, 2013 at 22:30

answered Nov 21, 2013 at 22:23

Birei

36.4k3 gold badges80 silver badges84 bronze badges

Comments

Casimir et Hippolyte · Accepted Answer · 2013-11-21 22:52:30Z

1

replace in your code s/\<\/A\>/\n/g; by s/\<\/A\>\K/\n/g; or s/(?<=<\/A>)/\n/g

Since \K resets the match before it, your closing tag is not removed.

Note: As far i know, you don't need to escape < and >

Note2: the href part of your code works only because the dot doesn't match newlines by default .* match all the line, then the regex engine backtracks to find the double quote). A better way is to use a lazy quantifier instead: <A\s+HREF="(.*?)". A more better way is to use \S* instead: <A\s+HREF="(\S*)" (only one backtrack step for the double quote, since an URL doesn't have white spaces inside). Or <A\s+HREF="([^"]+)" that avoid to match double quotes.

edited Nov 21, 2013 at 22:52

answered Nov 21, 2013 at 22:34

Casimir et Hippolyte

90k5 gold badges102 silver badges131 bronze badges

1 Comment

capser Over a year ago

yeah - even when i change the regex in the if statement, it does not work. i think that that is the problem - the way that the if statement is configured.

Collectives™ on Stack Overflow

parsing html with regex - using capture and $1 parameter

3 Answers 3

Comments

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related