0

I'm using XHTML Transitional doctype for displaying content in a browser. But, the content is displayed it is passed through a XML Parser (DOMDocument) for giving final touches before outputting to the browser.

I use a custom designed CMS for my website, that allows me to make changes to the site. I have a module that allows me to display HTML scripts on my website in a way similar to WordPress widgets.

The problem i am facing right now is that I need to make sure any code provided through this module should be in a valid XHTML format or else the module will need to convert the code to valid XHTML. Currently if a portion of the input code is not XHTML compliant then my XML parser breaks and throws warnings.

What I am looking for is a solution that encodes the entities present in the URLs and text portions of the input provided via TextArea control. For example the following string will break the parser giving entity reference error:

<script type="text/javascript" src="http://www.abcxyz.com/foo?bar=1&sumthing"></script>

Also the following line would cause same error:

<a href="http://www.somesite.com">Books & Cool stuff<a/>

P.S. If i use htmlentities or htmlspecialchars, they also convert the angle brackets of tags, which is not required. I just need the urls and text portions of the string to be escaped/encoded.

Any help would be greatly appreciated.

Thanks and regards, Waqar Mushtaq

3
  • Normalize it with Tidy first. Commented Aug 7, 2011 at 16:56
  • A proper solution to XML escaping woes would be facebook.com/notes/facebook-engineering/… - but that's fairly non-standard. So to keep the output XHTML syntax compliant (even if you are actually sending it with the wrong MIME type), is in fact to pipe it through libtidy. Commented Aug 7, 2011 at 17:12
  • thanks "hakre" you saved my day. Is tidy performance effective? Commented Aug 7, 2011 at 22:11

3 Answers 3

1

What you'd need to do is generate valid XHTML in the first place. All your attributes much be htmlentitied.

<script type="text/javascript" src="http://www.abcxyz.com/foo?bar=1&sumthing"></script>

should be

<script type="text/javascript" src="http://www.abcxyz.com/foo?bar=1&amp;sumthing"></script>

and

<a href="http://www.somesite.com">Books & Cool stuff</a>

should be

<a href="http://www.somesite.com">Books &amp; Cool stuff</a>

It's not easy to always generate valid XHTML. If at all possible I would recommend you find some other way of doing the post processing.

Sign up to request clarification or add additional context in comments.

2 Comments

thanks for sharing this but, it would not be feasible to modify each and every code snippet before pasting into the textarea control. I would my backend to take care of it.
Then you can't do it, or there will be cases where the output will be malformed.
0

As already suggested in a quick comment, you can solve the problem with the PHP tidy extensionDocs quite comfortable.

To convert a HTML fragment - even a good tag soup - into something DomDocument or SimpleXML can deal with, you can use something like the following:

$config = array(
    'output-xhtml' => 1,
    'show-body-only' => 1
);
$fragment = tidy_repair_string($html, $config);
$xhtml = sprintf("<body>%s</body>", $fragment);

Example: Format tag soup html as valid xhtml with tidy_repair_stringDocs.

Tidy has many options, these two used are needed for fragments and XHTML compatibility.

The only problem left now is that this XHTML fragment can contain entities that DomDocument or SimpleXML do not understand, for example &nbsp;. This and others are undefined in XML.

As far as DomDocument is concerned (you wrote you use it), it supports loading html instead of xml as well which deals with those entities:

$dom = new DomDocument;
$dom->loadHTML($xhtml);

Example: Loading HTML with DomDocument

2 Comments

Thanks tidy really helped. Just one more question. Is is an overhead to first parse the text using tidy and then load into XML parser?
@Waqar Mushtaq: If it actually allows your system to run as intended, I would not consider it overhead but necessary. You can try if $dom->loadHTML($fragment); already does the job however.
0

HTML Tidy is a computer program and a library whose purpose is to fix invalid HTML and to improve the layout and indent style of the resulting markup.

http://tidy.sourceforge.net/

Examples of bad HTML it is able to fix:

  • Missing or mismatched end tags, mixed up tags
  • Adding missing items (some tags, quotes, ...)
  • Reporting proprietary HTML extensions
  • Change layout of markup to predefined style
  • Transform characters from some encodings into HTML entities

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.