30

I have found several similar questions, but so far, none have been able to help me.

I am trying to output the 'src' of all images in a block of HTML, so I'm using DOMDocument(). This method is actully working, but I'm getting a warning on some pages, and I can't figure out why. Some posts suggested surpressing the warning, but I'd much rather find out why the warning is being generated.

Warning: DOMDocument::loadHTML(): htmlParseEntityRef: no name in Entity, line: 10

One example of post->post_content that is generating the error is -

On Wednesday 21st November specialist rights of way solicitor Jonathan Cheal of Dyne Drewett will be speaking at the Annual Briefing for Rural Practice Surveyors and Agricultural Valuers in Petersfield.
<br>
Jonathan is one of many speakers during the day and he is specifically addressing issues of public rights of way and village greens.
<br>
Other speakers include:-
<br>
<ul>
<li>James Atrrill, Chairman of the Agricultural Valuers Associates of Hants, Wilts and Dorset;</li>
<li>Martin Lowry, Chairman of the RICS Countryside Policies Panel;</li>
<li>Angus Burnett, Director at Martin & Company;</li>
<li>Esther Smith, Partner at Thomas Eggar;</li>
<li>Jeremy Barrell, Barrell Tree Consultancy;</li>
<li>Robin Satow, Chairman of the RICS Surrey Local Association;</li>
<li>James Cooper, Stnsted Oark Foundation;</li>
<li>Fenella Collins, Head of Planning at the CLA; and</li>
<li>Tom Bodley, Partner at Batcheller Monkhouse</li>
</ul>

I can post some more examples of what post->post_content contains if that would be helpful?

I have allowed access to a development site temporarily, so you can see some examples [Note - links no longer accessable as question has been answered] -

Any tips on how to resolve this? Thanks.

$dom = new DOMDocument();
$dom->loadHTML(apply_filters('the_content', $post->post_content)); // Have tried stripping all tags but <img>, still generates warning
$nodes = $dom->getElementsByTagName('img');
foreach($nodes as $img) :
    $images[] = $img->getAttribute('src');
endforeach;
10
  • 1
    Showing the line that caused the error would definitely make debugging it easier. Commented Feb 1, 2013 at 14:27
  • ??? The warning is on DOMDocument::loadHTML();, so the line causing the error is dom->loadHTML(apply_filters('the_content', $post->post_content)); Commented Feb 1, 2013 at 14:29
  • 1
    Line 10 of the content you're parsing... Commented Feb 1, 2013 at 14:40
  • Ok, with you. In one case, it's James Cooper, Stnsted Oark Foundation;. I did think it could be the ; causing the issue, but rempving them all (there were several before) didn't help. Commented Feb 1, 2013 at 14:43
  • 13
    @DavidGard My best guess then is that there is an unescaped ampersand (&) somewhere in the HTML. This will make the parser think we're in an entity reference (e.g. &copy;). When it gets to ;, it thinks the entity is over. It then realises what it has doesn't conform to an entity, so it sends out a warning and returns the content as plain text. Commented Feb 1, 2013 at 14:49

9 Answers 9

46

This correct answer comes from a comment from @lonesomeday.

My best guess then is that there is an unescaped ampersand (&) somewhere in the HTML. This will make the parser think we're in an entity reference (e.g. ©). When it gets to ;, it thinks the entity is over. It then realises what it has doesn't conform to an entity, so it sends out a warning and returns the content as plain text.

Sign up to request clarification or add additional context in comments.

4 Comments

So how do I fix it? I cant call htmlentities on whole html string.
@MavWolverine I know this is many years later, but I just stubbled into this same issue. The simplest option I found was just to do a string replace str_replace(' & ', ' &amp; ', $string) as htmlentities and htmlspecialcharacters caused the < and > of the HTML tags to be converted. Now I am 100% sure there is a better way to do this, but that sorted what I needed on a simple one off parse job.
@PanPipes a little more restrictive: preg_replace("/&(?!\S+;)/", "&amp;", $string).
This saves my day, I was struggling and later on finds that the contents generated by a user include & in a name and that was a source of all errors. Thanks
28

As mentionned here

Warning: DOMDocument::loadHTML(): htmlParseEntityRef: expecting ';' in Entity,

you can use :

libxml_use_internal_errors(true);

see http://php.net/manual/en/function.libxml-use-internal-errors.php

3 Comments

And loading html as this @$dom->loadHTML($html); helps me.
This fixed my problem
Great, again stackoverflow saved me ;)
5

An unescaped "&" somewhere in the HTML and replace "&" with &amp. Here is my solution!

 $html = preg_replace('/&(?!amp)/', '&amp;', $html);

It will replace the single ampersand with "&amp" but current "&amp" will still remain the same.

Comments

3

Check "&" character in your HTML code anywhere.I had that issue because of that scenario.

1 Comment

And replace & with &amp;
1

I don't have the reputation required to leave a comment above, but using htmlspecialchars solved this problem in my case:

$inputHTML = htmlspecialchars($post->post_content);
$dom = new DOMDocument();
$dom->loadHTML(apply_filters('the_content', $inputHTML)); // Have tried stripping all tags but <img>, still generates warning
$nodes = $dom->getElementsByTagName('img');
foreach($nodes as $img) :
    $images[] = $img->getAttribute('src');
endforeach;

For my purposes, I'm also using strip_tags($inputHTML, "<strong><em><br>"), so all image tags are stripped out as well - I'm not sure if this would be a problem otherwise.

Comments

0

I eventually solved this problem the right way, using tidy

// Configuration
$config = array(
    'indent'         => true,
    'output-xhtml'   => true,
    'wrap'           => 200);

// Tidy to avoid errors during load html
$tidy = new tidy;
$tidy->parseString($bill->bill_text, $config, 'utf8');
$tidy->cleanRepair();

$domDocument = new DOMDocument();
$domDocument->loadHTML(mb_convert_encoding($tidy, 'HTML-ENTITIES', 'UTF-8'));

2 Comments

Welcome to StackOverflow. please explain how your code solves the problem.
I believe that loadHTML method has trouble dealing with malformed HTML. Using tidy helped me solve this issue.
0

For laravel,

Use {{ }} instead of {!! !!}

I faced this and I managed to solved it.

Comments

0

I found there was an error in my table tags. There was an extra </td> that I removed and bingo.

Comments

-8

just replace "&" with "and" in your string. do that for all the other symbols

1 Comment

No, that's a terrible suggestion. The use of & is for a specific purpose, and simply replacing it with and doesn't conform in most cases. Company names are one obvious example.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.