1

Ok. Admittedly, I am not the best at working with regular expressions. What I am doing is a screen scrape, then trying to fix the img src values in the embedded images to point back to the original domain. This is the regex I have been trying variations of (too many to list - here's the current one):

preg_match_all('/<img\b[^>]*>/i', $html, $images);  

What this ends up doing is to replace all < with />. What I need it to do is just return the (currently) five images on the page in an array so that I can work with those to fix their src values, then write them back to $html, which is set at the beginning of the file:

$html = file_get_contents($target_url);
2
  • 3
    It seems like you're just trying to get the src attribute. Will DomDocument or even simple xml not do? Commented Feb 22, 2011 at 22:44
  • 3
    stackoverflow.com/questions/1732348/… Commented Feb 22, 2011 at 22:44

1 Answer 1

5

Basically, don't do this with regex. You can parse HTML with regex, but it is almost certainly not worth the effort.

Do it with genuine DOM parsing instead, using the DOMDocument class:

$dom = new DOMDocument;
$dom->loadHTML($html);
$images = $dom->getElementsByTagName('img');
foreach ($images as $image) {
    $image->setAttribute('src', 'http://example.com/' . $image->getAttribute('src'));
}
$html = $dom->saveHTML();
Sign up to request clarification or add additional context in comments.

1 Comment

And if you're familiar with jQuery you can try code.google.com/p/phpquery

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.