Extract URL from a list of url in perl

Question

using the following code im getting all the url in a site

while( $html =~ m/<A HREF=\"(.*?)\"/g ) {    
      print "$1\n";  
  }

which gives me all the URL . but my question is i wanna extract only the url ends with

1) .pdf

or

2) .doc

for example

http://nc.casaforchildren.org/files/public/site/jobs/CSO.pdf

can any one help me thanks .

I assume you understand all the standard caveats about not parsing HTML with regular expressions, and have a good reason for ignoring them :-) — Dave Cross
– Dave Cross, Commented Aug 22, 2013 at 9:44
There's a really good explanation in the accepted answer here - stackoverflow.com/questions/590747/… — Dave Cross
– Dave Cross, Commented Aug 22, 2013 at 14:07

Toto · Accepted Answer · 2013-08-22 07:31:22Z

1

I guess you need to search case insensitive:

while( $html =~ m/<A HREF="(.*?\.(?:pdf|doc))"/ig ) {    
    print "$1\n";  
}

answered Aug 22, 2013 at 7:31

Toto

91.7k63 gold badges97 silver badges135 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Vijay · Accepted Answer · 2013-08-22 07:32:16Z

1

 m/<A HREF=\"(.*?(.pdf|.doc))\"/g

Its working at my place:

> cat temp
<A HREF="http://nc.casaforchildren.org/files/public/site/jobs/CSO.pdf">bwfjbwej</A>
<A HREF="http://nc.casaforchildren.org/files/public/site/jobs/CSO.xls">bwfjbwej</A>
<A HREF="http://nc.casaforchildren.org/files/public/site/jobs/CSO.doc">bwfjbwej</A>

> perl -lne 'print $1 if(/<A HREF=\"(.*?(.pdf|.doc))\"/g)' temp
http://nc.casaforchildren.org/files/public/site/jobs/CSO.pdf
http://nc.casaforchildren.org/files/public/site/jobs/CSO.doc
>

edited Aug 22, 2013 at 7:32

answered Aug 22, 2013 at 7:18

Vijay

67.7k94 gold badges238 silver badges327 bronze badges

2 Comments

EverythingRightPlace Over a year ago

You have to escape the . otherwise it matches anything and not a literal dot before pdf|doc.

backtrack Over a year ago

+1 you gave me a correct answer but it your answer is case sensitive

EverythingRightPlace · Accepted Answer · 2013-08-22 07:35:47Z

1

If your grouping (.*?) matches all URLs, you should go with:

while( $html =~ m/<A HREF=\"(.*?(\.pdf|\.doc))\"/g ) {    
      print "$1\n";  
  }

Be aware that this matches also .pdf which might not be what you are searching. The pattern .*? is greedy and quite dangerous imo.

/edit

I tried it on http://regexpal.com/

\b(.*(\.pdf|\.doc))\b

for

http://nc.casaforchildren.org/files/public/site/jobs/CSO.pdf
http://nc.casaforchildren.org/files/public/site/jobs/CSO.doc
http://nc.casaforchildren.org/files/public/site/jobs/CSO.pdd
.pdf
http://nc.casaforchildren.org/files/public/site/jobs/CSO.pdfawd

It matches just the first two URLs.

edited Aug 22, 2013 at 7:35

answered Aug 22, 2013 at 7:21

EverythingRightPlace

1,1971 gold badge12 silver badges37 bronze badges

7 Comments

EverythingRightPlace Over a year ago

What are you observing? It matches all URLs once again?

backtrack Over a year ago

@bashophil.. i saw a blank black screen

EverythingRightPlace Over a year ago

You could try (.*\.pdf|.*\.doc). Btw I would suggest to add boundaries around your pattern: \b

backtrack Over a year ago

+1 you gave me a correct answer but it your answer is case sensitive

EverythingRightPlace Over a year ago

@Backtrack the case sensitivity option will just add the tags PDF or pDf or something like this. I am not sure but I guess this is irrelevant. The stuff before pdf will match nevertheless because .* is greedy. Btw .*? is the same like .*. Anything matches zero times up to infinity times.

|

Collectives™ on Stack Overflow

Extract URL from a list of url in perl

3 Answers 3

Comments

2 Comments

7 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

2 Comments

7 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related