0

I am really confused with regular expressions for PHP.

Anyway, I cant read the whole tutorial thing now because I have a bunch of files in html which I have to find links in there ASAP. I came up with the idea to automate it with a php code which it is the language I know.

so I think I can user this script :

$address = "file.txt"; 
$input = @file_get_contents($address) or die("Could not access file: $address");
$regexp = "??????????"; 
if(preg_match_all("/$regexp/siU", $input, $matches)) { 
    // $matches[2] = array of link addresses 
   // $matches[3] = array of link text - including HTML code 
} 

My problem is with $regexp

My required pattern is like this:

href="/content/r807215r37l86637/fulltext.pdf" title="Download PDF

I want to search and get the /content/r807215r37l86637/fulltext.pdf from above lines which I have many of them in the files.

any help?

==================

edit

title attributes are important for me and all of them which I want, are titled

title="Download PDF"

5 Answers 5

5

Once again regexp are bad for parsing html.

Save your sanity and use the built in DOM libraries.

$dom = new DOMDocument();
@$dom->loadHTML($html);
$x = new DOMXPath($dom);
    $data = array();
foreach($x->query("//a[@title='Download PDF']") as $node)
{
    $data[] = $node->getAttribute("href");
}

Edit Updated code based on ircmaxell comment.

Sign up to request clarification or add additional context in comments.

5 Comments

uhhh. Why xpath if you're going to do only a nodename search? Why not just $dom->getElementsByTagName('a');? I could understand xpath if you did $x->query('//a[contains(@title, "Download Pdf")]'); which would return the exact match... ;-)
@ircmaxell, you are exactly right.getElementsByTagName() is probably a more efficient way to do it..
@safaali in the query, change @title='Download Pdf' to @class='nameOfClass' or use contains(@title, 'Download PDF'). Contains will grab them even if they have extra stuff in them.
thank you! should I install The DOMDocument class? I am using xampp 1.7.3 on my localhost.
@safaali, all the dom libraries are built in. No need to install anything.
1

That's easier with phpQuery or QueryPath:

foreach (qp($html)->find("a") as $a) { 
    if ($a->attr("title") == "PDF") {
        print $a->attr("href");
        print $a->innerHTML();
    }
}

With regexps it depends on some consistency of the source:

preg_match_all('#<a[^>]+href="([^>"]+)"[^>]+title="Download PDF"[^>]*>(.*?)</a>#sim', $input, $m);

Looking for a fixed title="..." attribute is doable, but more difficult as it depends on the position before the closing bracket.

5 Comments

@Byron: Some people have aversions to needlessly cumbersome APIs.
@mario Have you actually tried dom parsing with the built in libs? I'll admit, the php site docs on dom are cumbersome. I resisted at first too until I saw the light. It really is easy. If you know xquery, DOMXPath::xquery is all you need.
@Byron: Tried and used. But much like raw Javascript DOM methods, I'm avoiding it.
@mario fair enough. Just curious, do either of those libs use the built in php dom under the hood?
@Byron: Actually all of them do (phpQuery, QueryPath, FluentDom) as far as I'm aware. Though QP comes with its own alternative parser for more quirky HTML.
1

try something like this. If it does not work, show some examples of links you want to parse.

<?php
$address = "file.txt"; 
$input = @file_get_contents($address) or die("Could not access file: $address");
$regexp = '#<a[^>]*href="([^"]*)"[^>]*title="Download PDF"#'; 

if(preg_match_all($regexp, $input, $matches, PREG_SET_ORDER)) { 
  foreach ($matches as $match) {
    printf("Url: %s<br/>", $match[1]);
  }
} 

edit: updated so it searches for Download "PDF entries" only

3 Comments

if you want just those with "Download PDF" in title, update the $regexp to '#<a[^>]*href="([^"]*)"[^>]*title="Download PDF"#' (in which case you don't need $match[2]..)
While this works for this example. With a few exceptions, do not parse html with a regexp. You will lose your sanity quickly that way (see the link in my post)
Yeah, probably yes, but when you want something quickly parsed, even html, it comes in handy.
1

The best way is to use DomXPath to do the search in one step:

$dom = new DomDocument();
$dom->loadHTML($html);
$xpath = new DomXPath($dom);

$links = array();
foreach($xpath->query('//a[contains(@title, "Download PDF")]') as $node) {
    $links[] = $node->getAttribute("href");
}

Or even:

$links = array();
$query = '//a[contains(@title, "Download PDF")]/@href';
foreach($xpath->evaluate($query) as $attr) {
    $links[] = $attr->value;
}

Comments

0

href="([^]+)" will get you all the links of that form.

1 Comment

Thank you, But there are many herfs in the file, I want that links which are titled "Download PDF"

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.