PHP: regex search a pattern in a file and pick it up

Question

I am really confused with regular expressions for PHP.

Anyway, I cant read the whole tutorial thing now because I have a bunch of files in html which I have to find links in there ASAP. I came up with the idea to automate it with a php code which it is the language I know.

so I think I can user this script :

$address = "file.txt"; 
$input = @file_get_contents($address) or die("Could not access file: $address");
$regexp = "??????????"; 
if(preg_match_all("/$regexp/siU", $input, $matches)) { 
    // $matches[2] = array of link addresses 
   // $matches[3] = array of link text - including HTML code 
}

My problem is with $regexp

My required pattern is like this:

href="/content/r807215r37l86637/fulltext.pdf" title="Download PDF

I want to search and get the /content/r807215r37l86637/fulltext.pdf from above lines which I have many of them in the files.

any help?

==================

edit

title attributes are important for me and all of them which I want, are titled

title="Download PDF"

Community · Accepted Answer · 2017-05-23 11:55:43Z

5

Once again regexp are bad for parsing html.

Save your sanity and use the built in DOM libraries.

$dom = new DOMDocument();
@$dom->loadHTML($html);
$x = new DOMXPath($dom);
    $data = array();
foreach($x->query("//a[@title='Download PDF']") as $node)
{
    $data[] = $node->getAttribute("href");
}

Edit Updated code based on ircmaxell comment.

edited May 23, 2017 at 11:55

CommunityBot

11 silver badge

answered Feb 11, 2011 at 20:25

Byron Whitlock

54.2k29 gold badges128 silver badges170 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

ircmaxell Over a year ago

uhhh. Why xpath if you're going to do only a nodename search? Why not just $dom->getElementsByTagName('a');? I could understand xpath if you did $x->query('//a[contains(@title, "Download Pdf")]'); which would return the exact match... ;-)

Byron Whitlock Over a year ago

@ircmaxell, you are exactly right.getElementsByTagName() is probably a more efficient way to do it..

Byron Whitlock Over a year ago

@safaali in the query, change @title='Download Pdf' to @class='nameOfClass' or use contains(@title, 'Download PDF'). Contains will grab them even if they have extra stuff in them.

Alireza Over a year ago

thank you! should I install The DOMDocument class? I am using xampp 1.7.3 on my localhost.

Byron Whitlock Over a year ago

@safaali, all the dom libraries are built in. No need to install anything.

mario · Accepted Answer · 2011-02-11 20:32:01Z

1

That's easier with phpQuery or QueryPath:

foreach (qp($html)->find("a") as $a) { 
    if ($a->attr("title") == "PDF") {
        print $a->attr("href");
        print $a->innerHTML();
    }
}

With regexps it depends on some consistency of the source:

preg_match_all('#<a[^>]+href="([^>"]+)"[^>]+title="Download PDF"[^>]*>(.*?)</a>#sim', $input, $m);

Looking for a fixed title="..." attribute is doable, but more difficult as it depends on the position before the closing bracket.

edited Feb 11, 2011 at 20:32

answered Feb 11, 2011 at 20:26

mario

146k20 gold badges243 silver badges293 bronze badges

5 Comments

mario Over a year ago

@Byron: Some people have aversions to needlessly cumbersome APIs.

Byron Whitlock Over a year ago

@mario Have you actually tried dom parsing with the built in libs? I'll admit, the php site docs on dom are cumbersome. I resisted at first too until I saw the light. It really is easy. If you know xquery, DOMXPath::xquery is all you need.

mario Over a year ago

@Byron: Tried and used. But much like raw Javascript DOM methods, I'm avoiding it.

Byron Whitlock Over a year ago

@mario fair enough. Just curious, do either of those libs use the built in php dom under the hood?

mario Over a year ago

@Byron: Actually all of them do (phpQuery, QueryPath, FluentDom) as far as I'm aware. Though QP comes with its own alternative parser for more quirky HTML.

Ondrej Skalicka · Accepted Answer · 2011-02-11 20:32:07Z

1

try something like this. If it does not work, show some examples of links you want to parse.

<?php
$address = "file.txt"; 
$input = @file_get_contents($address) or die("Could not access file: $address");
$regexp = '#<a[^>]*href="([^"]*)"[^>]*title="Download PDF"#'; 

if(preg_match_all($regexp, $input, $matches, PREG_SET_ORDER)) { 
  foreach ($matches as $match) {
    printf("Url: %s<br/>", $match[1]);
  }
}

edit: updated so it searches for Download "PDF entries" only

edited Feb 11, 2011 at 20:32

answered Feb 11, 2011 at 20:25

Ondrej Skalicka

3,1269 gold badges34 silver badges54 bronze badges

3 Comments

Ondrej Skalicka Over a year ago

if you want just those with "Download PDF" in title, update the $regexp to '#<a[^>]*href="([^"]*)"[^>]*title="Download PDF"#' (in which case you don't need $match[2]..)

Byron Whitlock Over a year ago

While this works for this example. With a few exceptions, do not parse html with a regexp. You will lose your sanity quickly that way (see the link in my post)

Ondrej Skalicka Over a year ago

Yeah, probably yes, but when you want something quickly parsed, even html, it comes in handy.

ircmaxell · Accepted Answer · 2011-02-11 20:37:06Z

1

The best way is to use DomXPath to do the search in one step:

$dom = new DomDocument();
$dom->loadHTML($html);
$xpath = new DomXPath($dom);

$links = array();
foreach($xpath->query('//a[contains(@title, "Download PDF")]') as $node) {
    $links[] = $node->getAttribute("href");
}

Or even:

$links = array();
$query = '//a[contains(@title, "Download PDF")]/@href';
foreach($xpath->evaluate($query) as $attr) {
    $links[] = $attr->value;
}

answered Feb 11, 2011 at 20:37

ircmaxell

166k36 gold badges269 silver badges316 bronze badges

Comments

Blindy · Accepted Answer · 2011-02-11 20:22:10Z

0

href="([^]+)" will get you all the links of that form.

answered Feb 11, 2011 at 20:22

Blindy

68k10 gold badges96 silver badges141 bronze badges

1 Comment

Alireza Over a year ago

Thank you, But there are many herfs in the file, I want that links which are titled "Download PDF"

Collectives™ on Stack Overflow

PHP: regex search a pattern in a file and pick it up

edit

5 Answers 5

5 Comments

5 Comments

3 Comments

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

edit

5 Answers 5

5 Comments

5 Comments

3 Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related