0

using the following code im getting all the url in a site

while( $html =~ m/<A HREF=\"(.*?)\"/g ) {    
      print "$1\n";  
  }

which gives me all the URL . but my question is i wanna extract only the url ends with

1) .pdf

or

2) .doc

for example

http://nc.casaforchildren.org/files/public/site/jobs/CSO.pdf

can any one help me thanks .

6
  • why are you searching for " ? Commented Aug 22, 2013 at 7:11
  • Im constructing a spider.. Commented Aug 22, 2013 at 7:12
  • I assume you understand all the standard caveats about not parsing HTML with regular expressions, and have a good reason for ignoring them :-) Commented Aug 22, 2013 at 9:44
  • @DaveCross can you kindly explain me . Commented Aug 22, 2013 at 11:27
  • 1
    There's a really good explanation in the accepted answer here - stackoverflow.com/questions/590747/… Commented Aug 22, 2013 at 14:07

3 Answers 3

1

I guess you need to search case insensitive:

while( $html =~ m/<A HREF="(.*?\.(?:pdf|doc))"/ig ) {    
    print "$1\n";  
}
Sign up to request clarification or add additional context in comments.

Comments

1
 m/<A HREF=\"(.*?(.pdf|.doc))\"/g

Its working at my place:

> cat temp
<A HREF="http://nc.casaforchildren.org/files/public/site/jobs/CSO.pdf">bwfjbwej</A>
<A HREF="http://nc.casaforchildren.org/files/public/site/jobs/CSO.xls">bwfjbwej</A>
<A HREF="http://nc.casaforchildren.org/files/public/site/jobs/CSO.doc">bwfjbwej</A>

> perl -lne 'print $1 if(/<A HREF=\"(.*?(.pdf|.doc))\"/g)' temp
http://nc.casaforchildren.org/files/public/site/jobs/CSO.pdf
http://nc.casaforchildren.org/files/public/site/jobs/CSO.doc
>

2 Comments

You have to escape the . otherwise it matches anything and not a literal dot before pdf|doc.
+1 you gave me a correct answer but it your answer is case sensitive
1

If your grouping (.*?) matches all URLs, you should go with:

while( $html =~ m/<A HREF=\"(.*?(\.pdf|\.doc))\"/g ) {    
      print "$1\n";  
  }

Be aware that this matches also .pdf which might not be what you are searching. The pattern .*? is greedy and quite dangerous imo.

/edit

I tried it on http://regexpal.com/

\b(.*(\.pdf|\.doc))\b

for

http://nc.casaforchildren.org/files/public/site/jobs/CSO.pdf
http://nc.casaforchildren.org/files/public/site/jobs/CSO.doc
http://nc.casaforchildren.org/files/public/site/jobs/CSO.pdd
.pdf
http://nc.casaforchildren.org/files/public/site/jobs/CSO.pdfawd

It matches just the first two URLs.

7 Comments

What are you observing? It matches all URLs once again?
@bashophil.. i saw a blank black screen
You could try (.*\.pdf|.*\.doc). Btw I would suggest to add boundaries around your pattern: \b
+1 you gave me a correct answer but it your answer is case sensitive
@Backtrack the case sensitivity option will just add the tags PDF or pDf or something like this. I am not sure but I guess this is irrelevant. The stuff before pdf will match nevertheless because .* is greedy. Btw .*? is the same like .*. Anything matches zero times up to infinity times.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.