0

I have a table in a MySQL based CMS, one of whose fields contains the text of articles displayed in the CMS web pages.

Some of the articles contain images embedded in the text, in the form of HTML 'img' tags. There may be one or several images in the text contained in the field.

What I want to do is to create a query that will extract a list of all the images in all the articles. I have managed to create some code as follows:

SELECT nid, 
substr(body,locate('<img', body),(locate('>',body,locate('<img', body)) - locate('<img', body))) as image,
body FROM `node_revisions` where body like '%<img%'

and this seems to work ok, however of course it only extracts the first image and I would really like to extract all of them (in fact of course this would generally mean using a loop but that doesn't seem possible in MySQL).

Just for reference, the CMS in question is Drupal 6, hence the names of the fields and table. However, this is really a question about MySQL not Drupal which is why I'm asking here not on the Drupal Stackexchange site.

1
  • I suggest doing this with something like PHP rather that MySQL. This answer might be informative. Here's another article. Commented Aug 5, 2016 at 18:41

2 Answers 2

1

You will drive yourself insane trying to use locate(), substring(), or regular expressions to parse HTML or XML. See https://blog.codinghorror.com/parsing-html-the-cthulhu-way/

I suggest you use PHP's DOMDocument class:

<?php

$bodyHtml = "now is the time for all <img src='good.jpg'> men to come to the <img src='aid.jpg'> of their country";

$dom = new DOMDocument();
$dom->loadHTML($bodyHtml);
$imgs = $dom->getElementsByTagName("img");
foreach ($imgs as $img) {
        print "$img->nodeName\n";
        foreach ($img->attributes as $attr) {
                print "  $attr->name=$attr->value\n";
        }
}

Outputs:

img
  src=good.jpg
img
  src=aid.jpg
Sign up to request clarification or add additional context in comments.

1 Comment

That works just fine, and for Drupal developer reference I was able to use the Views PHP module to generate the appropriate output in a View, as described in this documentation
0

Parsing html with regex is never 100%, you'll never feel confident you've got every image and correctly formatted,

The other problem you have is one you hinted at in your question. you have one record in node_revisions that may contain 1, or 2 or 10,000 images. There is no way in SQL you can return each image as a new row in your query results so you'd have to to return each image as a new column.

Meaning you would literally manually need to specify each column by hand:

SELECT code_to_return_img_1 as url1
      ,code_to_return_img_2 as url2
      ,code_to_return_img_3 as url3
      ,code_to_return_img_4 as url4
      ,code_to_return_img_5 as url5
      ,code_to_return_img_6 as url6
      ....
      and so on

If you knew there would only be less than, say 20 images per article and you didn't have php/java/python at your disposal and it was just a one off hack job you needed then you could do it with regex and SQL but your 30 minute job could turn into a 2 day job and a burst vein.

If Java is an option: https://jsoup.org/

If Python is an option: https://docs.python.org/2/library/htmlparser.html

If PHP is an option: http://htmlparsing.com/php.html

$dom = new DOMDocument;
$dom->loadHTML($html);
$images = $dom->getElementsByTagName('img');
foreach ($images as $image) {
    $imgurl = $image->getAttribute('src');
}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.