PHP parse xml with html content

Question

Is it possible in php with the default xml classes to parse an xml file in such a way that only elements from one namespace are considered to be xml? I want to parse xml files in which some elements contain html code, and preferably I don't want to encapsulate every element with cdata tags, or escape all special characters. Since html has a syntax quite similar to xml, most parsers won't be able to parse this correctly.

Example:

<ns:root>
    <ns:date>
        06-12-2011
    </ns:date>
    <ns:content>
        <html>
        <head>
        <title>Sometitle</title>
        </head>
        <body>
        --a lot of stuff here
        </body>
        </html>
    </ns:content>
</ns:root>

In this example I want all the html content inside to be the content of that element, and it shouldn't be parsed itself. Is this possible with the default parsers like simplexml etc, or should I write my own parser?

Edit: Let me explain my situation a little bit better: I want to create a little personal php framework in which code is separated from the HTML (similar to MVC, but not quite the same). However, many HTML code will be the same in multiple pages, but not everything, and some data from e.g. a database should be inserted in some pages, nothing different from normal websites. So I came up with the idea to use separate html component files, which can be parsed by an html script. This would look something like this:

main.fw:

<html>
<head>
    <title>
        <fw:placeholder name="title" />
    </title>
</head>
<body>
    <div id="menubar">
        <ul>
            <li>page1</li>
            <li>page2</li>
        </ul>
    </div>
    <div id="content>
        <fw:placeholder name="maincontent" />
    </div>
</body>
</html>

page1.fw

<fw:component file="main.fw">
    <fw:content name="title">
        page1
    </fw:content>
    <fw:content name="maincontent" />
        some content with html
    </fw:content>
</fw:component>

Result after parsing: page1

page1
page2

some content with html

This question is mainly about that second type of file, in which html is nested inside xml elements.

This has been done million times before. Take a look how other PHP CMS systems do it, I guess the have found a way that proved to by good. — CodeZombie
– CodeZombie, Commented Dec 6, 2011 at 23:31
I already thought many had done that before, and that's why I thought it should be possible. Do you happen to know a CMS which uses something similar? — Tiddo
– Tiddo, Commented Dec 6, 2011 at 23:35

Francis Avila · Accepted Answer · 2011-12-07 00:01:36Z

1

An XML file with some parts that are not XML is not an XML file. Thus you can't expect that an XML parser will be able to parse it. For a document to be XML the whole thing must be XML.

What you are asking for is essentially "is there a parser that will parse my made-up angle-bracket language." Maybe DOMDocument->loadHTML() or html5lib will interpret it according to your expectations, but no guarantees.

Is it really a terrible burden for your included "html" bits to be valid XML? This is good HTML hygiene anyway, and if you are willing to do that, you can implement your entire view system with XSL templates very easily. Most of the benefit of a node-aware template system is precisely that you can manipulate nodes directly and have pretty good assurances that the final document will be valid. Why have the burden of node-awareness with none of the benefit? You might as well use a string-based system like every other template system out there. At least it will be faster.

Note that once you have constructed your final DOM, you can output it as something else, like HTML, so just because all your input templates are XML doesn't mean your output has to be.

answered Dec 7, 2011 at 0:01

Francis Avila

31.8k7 gold badges63 silver badges99 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Tiddo Over a year ago

This system isn't meant to provide an interface to an php application to parse or edit html. This system is meant to seperate the gui, and thus the html, entirely from the php code, just like MVC does. However, if we're speaking in MVC terms, I want the view to be modular. I've used ASP.NET MVC as an inspiration, but there you only have a master page (like main.fw in my example) and for every page 1 view (like page1.fw). I want to create a similar system, but without the limit of only having 1 reusable view-component (the master page). (no chars left...)

Tiddo Over a year ago

And it isn't a burden for me to have the html bits to be valid xml, not at all, however this still give me some problems with for example the doctype definition of the html page. And again, this system is meant to 'merge' components together, not to complain about the xml-ness of the html page. I'm sorry if this is a little nitpicking by the way.

Francis Avila Over a year ago

You might like VTE. I believe it can support what you are doing, and they've taken care of those details.

timing · Accepted Answer · 2011-12-06 12:51:16Z

1

You can use textContent when using DOMDocument: http://www.php.net/manual/en/class.domnode.php

answered Dec 6, 2011 at 12:51

timing

6,4951 gold badge20 silver badges16 bronze badges

1 Comment

Tiddo Over a year ago

But how can this parse an xml file? As far as I understand from the documentation this function reads something from a PHP DOMNode. However, I want to load an xml file into some form of php xml representation. So the problem is getting it into the domnode, not getting it from it.

CodeZombie · Accepted Answer · 2011-12-06 23:36:25Z

1

You want the HTML code to be considered as non XML code and thats exactly what character data (CDATA) is designed.

<ns:root>
    <ns:date>
        06-12-2011
    </ns:date>
    <ns:content>
        <![CDATA[
            <html>
            <head>
            <title>Sometitle</title>
            </head>
            <body>
            --a lot of stuff here
            </body>
            </html>
        ]]>
    </ns:content>
</ns:root>

Better rely on this than write your own parser. Use the XMLWriter::writeCData() method to write the CDATA section.

Important: HTML tags inside the CDATA section do not need to be encoded!

Quote from Wikipedia CDATA:

However, if written like this:

<![CDATA[<sender>John Smith</sender>]]>

then the code is interpreted the same as if it had been written like this:

&lt;sender&gt;John Smith&lt;/sender&gt;

edited Dec 6, 2011 at 23:36

answered Dec 6, 2011 at 22:36

CodeZombie

5,3773 gold badges32 silver badges37 bronze badges

6 Comments

Tiddo Over a year ago

The point is, the XML file should be human friendly (both for reading and writing), and I think the CDATA tags aren't that human friendly, so preferably I want to avoid those. I was hoping I could instruct an xml parser to only parse a certain namespace as xml. As an ugly workaround I could insert those CDATA tags with regex before feeding it to an xml parser, since I know where they should be, but that's a lot of unnecessary processing and very error prone

CodeZombie Over a year ago

Why don't you separate the HTML (the content I assume) and the XML (the metadata I assume) into different files? This way you don't need to mess with an ugly solution and the HTML can even be previewed by opening the file in a browser.

Tiddo Over a year ago

I've added some more detail in my original question. It might give you a better understanding of what I am trying to achieve

CodeZombie Over a year ago

See my edited post. Using CDATA you don't need to escape the HTML so it remains in a human readable from!

Tiddo Over a year ago

But I don't want the CDATA tags either, preferably, even though it's a little nitpicking. I just want the code I'll use myself, I don't want to insert any code which is just needed for the xml parser, or at least as little as possible. I know I can use CDATA tags, but this question is how about to avoid those.

|

Tiddo · Accepted Answer · 2011-12-07 12:01:30Z

I decided to create a simple parser to see what the results would be. Since I don't parse valid XML, I will call it XMLIsh from now on.

The parser works quite well actually, and the peformance isn't that bad either: I did some testing and I found out that it's only ~10 times slower than SimpleXMLElement on valid xml documents, while SimpleXMLElement is build in php functionality and my function is php only. And this parser also works on 'XMLIsh' documents, as described a few times before. So as long as super fast speed is not required, this might be a valid solution.

In my situation these documents are only parsed once in a while, since the output is cached, so I think this will work for me.

Anyway, this is my code:

/**
 * This function parses a string as an XMLIsh document. An XMLIsh document is very similar to xml, but only one namespace should be parsed. 
 * 
 * parseXMLish walks through the document and creates a tree while doing so. 
 * Each element will be represented as an array, with the following content:
 * -index = 0: An array with as first element (index = 0) the type of the element. All following elements are its arguments with index=name and value=value.
 * -index = 1: Optional:an array with the content of this element. If the content is a string, this array will only have one element, namely the content of the string.
 * 
 * @param &$string The XMLIsh string to be parsed
 * @param $namespace The namespace which should be parsed.
 * @param &$offset The starting point of parsing. Default = 0
 * @param $previousTag The current opening tag. This argument shouldn't be set manually, this argument is needed for this function to check if a closing tag is valid.
 */
function parseXMLish(&$string,$namespace,&$offset=0,$openingTag = ""){
    //Whitespace doesn't matter, so trim it:)
    $string = trim($string);
    $result = array();
    //We need to find our mvc elements. These elements use xml syntax and should have the namespace mvc. 
    //Opening, closing and self closing tags are found.
    while(preg_match("/<(\/)?{$namespace}:(\w*)(.*?)(\/)?>/",$string,$matches,PREG_OFFSET_CAPTURE,$offset)){
        //Before our first mvc element, other text might have been found (e.g. html code). 
        //This should be added to our result array first. Again, strip the whitespace.
        $preText = substr($string,$offset,$matches[0][1]-$offset);
        $trimmedPreText = trim($preText);
        if (!empty($trimmedPreText))
            $result[] = $trimmedPreText;
        //We could have find 2 types of tags: closing and opening (including self closing) tags.
        //We need to distinguish between those two.
        if ($matches[1][0] == ''){
            //This tag was an opening tag. This means we should add this to the result array.
            //We add the name of this tag to the element first.
            $result[][0][0] = $matches[2][0];
            //Tags can also have arguments. We will find them here, and store them in the result array.
            preg_match_all("/\s*(\w+=[\"']?\S+[\"'])/",$matches[0][0],$arguments);
            foreach($arguments[1] as $argument){
                list($name,$value)=explode("=",$argument);
                $value = str_replace("\"","",$value);
                $value = str_replace("'","",$value);
                $result[count($result)-1][0][$name]=$value;
            }
            //We need to recalculate our offset. So lets do that. 
            $offset +=  strlen($preText) + strlen($matches[0][0]);
            //Now we will have to fill our element with content. 
            //This is only necessary if this is a regular opening tag, and not a self-closing tag.
            if (!(isset($matches[4]) && $matches[4][0] == "/")){
                $content = parseXMLish($string, $namespace, $offset,$matches[2][0]);                
            }
            //Only add content when there is any. 
            if (!empty($content))
                $result[count($result)-1][] = $content;
        }else{
            //This tag is a closing tag. It means that we only have to update the offset, and that we can go one level up
            //That is: return what we have so far back to the previous level. 
            //Note: the closing tag is the closing tag of the previous level, not of the current level. 
            if ($matches[2][0] != $openingTag)
                throw new Exception("Closing tag doesn't match the opening tag. Opening tag: $previousTag. Closing tag: {$matches[2][0]}");
            $offset +=  strlen($preText) + strlen($matches[0][0]);
            return $result;
        }
    }
    //If we have any text left after our last element, we should add that to the array too.
    $postText = substr($string,$offset);
    if (!empty($postText))
        $result[] = $postText;

    //We're done!
    return $result;     
}

Collectives™ on Stack Overflow

PHP parse xml with html content

4 Answers 4

3 Comments

1 Comment

6 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

3 Comments

1 Comment

6 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related