0

I have the following code snippet which essentially parses my blog site and store some information as variables:

global $articles;

$items = $html->find('div[class=blogpost]'); 

foreach($items as $post) {
    $articles[] = array($post->children(0)->innertext,
                        $post->children(1)->first_child()->outertext);
}

foreach($articles as $item) {
    echo $item[0]; 
    echo $item[1];
    echo "<br>";
}

The above code outputs as follows:

Title of blog post 1 <script type="text/javascript">execute_function(3,'')</script><a href="http://www.example.com/cool_news" id="963"  target="_blank" >Click here for news</a> &nbsp;<img src="/news.gif" width="12" height="12" title="validated" /><span class="title">
Title of blog post 2 <script type="text/javascript">execute_function(3,'')</script><a href="http://www.example.com/neato" id="963"  target="_blank" >Click here for neato</a> &nbsp;<img src="/news.gif" width="12" height="12" title="validated" /><span class="title">
Title of blog post 3 <script type="text/javascript">execute_function(3,'')</script><a href="http://www.example.com/lame" id="963"  target="_blank" >Click here for lame</a> &nbsp;<img src="/news.gif" width="12" height="12" title="validated" /><span class="title">

with $item[0] containing "Title of blog post X" and $item[1] containing the rest.

What I want to do is parse $item[1] and retain only the URL contained within it as a separate variable. Perhaps I am not phrasing my question correctly, but I cannot find anything that can help me figure this out.

Can anyone help me?

3
  • Use a preg_match with something like preg_match("href=\"(.*?)\"si", $source, $match); to get the href value in the string. Commented Dec 21, 2012 at 20:00
  • You're already parsing the HTML with a proper parser. You want to continue parsing on the <a> tag. You're doing it the right way now. Don't resort to regular expressions! Commented Dec 21, 2012 at 20:21
  • Thing is I dont know how to further parse it. The parser I am using doesn't appear to be supported any longer: net.tutsplus.com/tutorials/php/… Commented Dec 21, 2012 at 20:25

1 Answer 1

2

If you were to parse $item[1] into whatever DOM crawler object you were using for $html, you could use the following XPath

$item[1]->find('//a[0]/@href');

which will return

href="http://www.example.com/cool_news"

Then extract the url however you want, with PHP or refine the XPath query. Not sure what the XPath would be to get the value, perhaps someone might be able to expand on that one.

EDIT: Seeing as you using Simple DOM Parser, try the following

$blogItemHtml = new simple_html_dom();
$blogItemHtml->load($item[1]);

$anchors = $blogItemHtml->find('a');
echo $anchors[0]->href; // "http://www.example.com/cool_news"
Sign up to request clarification or add additional context in comments.

2 Comments

This is the parser I am using: net.tutsplus.com/tutorials/php/…
@Jonathan I've made an edit to my answer, hopefully should help

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.