3

I'm currently working on the sitemaps for a website, and I'm using SimpleXML to import and do some checks on the original XML file. after this I use simplexml_load_file("small.xml"); to convert it to DOMDocument to make it easier to precisely add and manipulate XML elements. Below is the test XML sitemap that i'm working from:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>http://www.companycheck.co.uk/searches/2011/08/22/23:52:32-Orouke.html</loc>
    <lastmod>2011-08-23</lastmod>
  </url>
  <url>
    <loc>http://www.companycheck.co.uk/searches/2011/08/22/23:53:23-castle technology.html</loc>
    <lastmod>2011-08-23</lastmod>
  </url>
  <url>
    <loc>http://www.companycheck.co.uk/searches/2011/08/22/23:53:38-banana split.html</loc>
    <lastmod>2011-08-23</lastmod>
  </url>
  <url>
    <loc>http://www.companycheck.co.uk/searches/2011/08/22/23:53:42-Waveney.html</loc>
    <lastmod>2011-08-23</lastmod>
  </url>
  <url>
    <loc>http://www.companycheck.co.uk/searches/2011/08/22/23:55:12-pure orange.html</loc>
    <lastmod>2011-08-23</lastmod>
  </url>
  <url>
    <loc>http://www.companycheck.co.uk/searches/2011/08/22/23:57:54-tau press.html</loc>
    <lastmod>2011-08-23</lastmod>
  </url>
  <url>
    <loc>http://www.companycheck.co.uk/searches/2011/08/22/23:59:21-E.f.m.html</loc>
    <lastmod>2011-08-23</lastmod>
  </url>
  <url>
    <loc>http://www.companycheck.co.uk/searches/2011/08/22/23:59:31-apple.html</loc>
    <lastmod>2011-08-23</lastmod>
  </url>
  <url>
    <loc>http://www.companycheck.co.uk/searches/2011/08/22/23:59:45-townhouse communications.html</loc>
    <lastmod>2011-08-23</lastmod>
  </url>
</urlset>

Now. here is the test code I'm using to modify:

<?php

$root = simplexml_load_file("small.xml");

$domRoot = dom_import_simplexml($root);

$dom = $domRoot->ownerDocument;

$urlElement = $dom->createElement("url");

    $locElement = $dom->createElement("loc");

        $locElement->appendChild($dom->createTextNode("www.google.co.uk"));

    $urlElement->appendChild($locElement);

    $lastmodElement = $dom->createElement("lastmod");

        $lastmodElement->appendChild($dom->createTextNode("2011-08-02"));

    $urlElement->appendChild($lastmodElement);

$domRoot->appendChild($urlElement);

$dom->formatOutput = true;
echo $dom->saveXML();

?>

The main problem is, that no matter where i place $dom->formatOutput = true; the existing XML that was imported from SimpleXML is formatted correctly, but anything new is formatted in the "all one line" style, as follows:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>http://www.companycheck.co.uk/searches/2011/08/22/23:52:32-Orouke.html</loc>
    <lastmod>2011-08-23</lastmod>
  </url>
  <url>
    <loc>http://www.companycheck.co.uk/searches/2011/08/22/23:53:23-castle technology.html</loc>
    <lastmod>2011-08-23</lastmod>
  </url>
  <url>
    <loc>http://www.companycheck.co.uk/searches/2011/08/22/23:53:38-banana split.html</loc>
    <lastmod>2011-08-23</lastmod>
  </url>
  <url>
    <loc>http://www.companycheck.co.uk/searches/2011/08/22/23:53:42-Waveney.html</loc>
    <lastmod>2011-08-23</lastmod>
  </url>
  <url>
    <loc>http://www.companycheck.co.uk/searches/2011/08/22/23:55:12-pure orange.html</loc>
    <lastmod>2011-08-23</lastmod>
  </url>
  <url>
    <loc>http://www.companycheck.co.uk/searches/2011/08/22/23:57:54-tau press.html</loc>
    <lastmod>2011-08-23</lastmod>
  </url>
  <url>
    <loc>http://www.companycheck.co.uk/searches/2011/08/22/23:59:21-E.f.m.html</loc>
    <lastmod>2011-08-23</lastmod>
  </url>
  <url>
    <loc>http://www.companycheck.co.uk/searches/2011/08/22/23:59:31-apple.html</loc>
    <lastmod>2011-08-23</lastmod>
  </url>
  <url>
    <loc>http://www.companycheck.co.uk/searches/2011/08/22/23:59:45-townhouse communications.html</loc>
    <lastmod>2011-08-23</lastmod>
  </url>
<url><loc>www.google.co.uk</loc><lastmod>2011-08-02</lastmod></url></urlset>

If anyone has an idea why this is happening and how to fix it I would be very grateful.

4
  • Out of curiosity, does the white spaces causing problem on your sitemap? Commented Sep 2, 2011 at 14:02
  • I'm not sure if they actually are causing problems or not, but I'd just rather solve the issue just in case. We have a number 1 google search ranking for particular terms at the moment, and I don't want to jeopardise that. (I realise it's still valid XML, i'd just rather it was laid out properly in case of any parsing issues) Commented Sep 2, 2011 at 14:30
  • Sitemap XML is meant for machine, I do not think white-spaces will matters to the Google.Is better you ask the question to webmaster.stackexchange.com Commented Sep 2, 2011 at 14:32
  • 1
    I now know what part of the problem is. The formatOutput and preserveWhiteSpace flags need to be set before the file is loaded. The problem is, i'm converting a pre-loaded SimpleXML object into a DOMDocument, so it inherits all the preserved whitespace etc from that object, just trying to find out if it's possible to tell SimpleXML not to format output or preserve whitespace when it loads the document, so I can pass a "clean" set of XML nodes to the DOMDocument once i've converted it Commented Sep 2, 2011 at 16:35

3 Answers 3

5

There is a workaround. You can force reformatting by saving your new xml to string first, then load it again after setting the formatOutput property, e.g.:

$strXml = $dom->saveXML();
$dom->formatOutput = true;
$dom->loadXML($strXml);
echo $dom->saveXML();
Sign up to request clarification or add additional context in comments.

Comments

2

To format output nicely, you need to set the preserveWhiteSpace variable to false before loading as stated in the documentation

Example:

$Xhtml = "<div><span></span></div>";
$doc = new DOMDocument('1.0','UTF-8');
$doc->preserveWhiteSpace = false;
$doc->formatOutput = true;
$doc->loadXML($Xhtml);
$formattedXhtml = $doc->saveXML($doc->documentElement, LIBXML_NOXMLDECL);
$expectedFormatting =<<<EOF
<div>
  <span/>
</div>
EOF;
$this->assertEquals($expectedFormatting,$formattedXhtml,"The XHTML is formatted");   

Just for the visitor that comes here as this was the first answer on Google Search.

Comments

0

I had this same problem using code like Simon's.

Turns out that when you disable errors (either with $doc->loadHTML(..., LIBXML_NOERROR) or libxml_use_internal_errors(true);), it won't format anymore (example: https://3v4l.org/ur76E).

The solution is to not disable errors and suppress them on the PHP side (with @).

Ugly, but it works: https://3v4l.org/BSJVu

The final silver bullet function looks like:

function beautifyDoc(DOMDocument $doc): void
{
    $previousLibXmlState = libxml_use_internal_errors(false);
    $previousErrorHandler = set_error_handler(null);
    try {
        $html = $doc->saveHTML();
        $doc->preserveWhiteSpace = false;
        $doc->formatOutput = true;
        @$doc->loadHTML($html);
    } finally {
        libxml_use_internal_errors($previousLibXmlState);
        set_error_handler($previousErrorHandler);
    }
}

// usage
$doc = new DOMDocument();
// ...load html and do stuff...
beautifyDoc($doc);
echo $doc->saveHTML(); // done

(it also takes care of the php error handler, if already set)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.