0

So I'm trying to get all the urls from a string with a script that looks like this:

$file = file_get_contents('something.txt');

function getUrls($string) {
    preg_match_all('~href=("|\')(.*?)\1~', $string, $out);
    print_r($out);
}

getUrls($file);

The urls contained in this document may be imperfect - i.e. "/blah/blah.asp?2". The problem is that when I run this script, I get an array that looks something like this:

Array
(
    [0] => Array
        (
            [0] => href="#A"
            [1] => href="#B"
            [2] => href="#C"
        )

    [1] => Array
        (
            [0] => "
            [1] => "
            [2] => "
        )

    [2] => Array
        (
            [0] => #A
            [1] => #B
            [2] => #C

        )

)

Any idea what could be going on here? I have no idea why it is returning alphabetical lists with hash signs instead of the desired urls. How can I go about just returning the urls?

3
  • There are hundreds of questions like that Commented Jun 27, 2013 at 21:41
  • I've been through them, mostly they address situations involving perfect urls like http : //www.example.com not the shortened ones I'm looking for. I've tried numerous solutions - no dice. Commented Jun 27, 2013 at 21:43
  • Print the contents of something.txt Commented Jun 27, 2013 at 21:52

1 Answer 1

8

The way of evil:

$file = file_get_contents('something.txt');    

function displayUrls($string) {
    $pattern = '~\bhref\s*+=\s*+["\']?+\K(?!#)[^\s"\'>]++~';
    preg_match_all($pattern, $string, $out);
    print_r($out[0]);
}

displayUrls($file);

The good way:

$doc = new DOMDocument();
@$doc->loadHTMLFile('something.txt');
$links = $doc->getElementsByTagName('a');
foreach($links as $link) {
    $href = $link->getAttribute('href');
    if ($href[0] != '#') $result[] = $href;
}
print_r($result);
Sign up to request clarification or add additional context in comments.

2 Comments

+1, I love answers like this... instead of preaching about how regex is bad for parsing text like html, this answer is like a Rosetta stone and helps teach a better way.
Wow, yeah this is a fantastic answer. I wish I could vote it up twice.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.