2

I'm trying to parse a XML file, but when loading it simpleXML prints the following warning:

Warning: simplexml_load_file() [function.simplexml-load-file]: gpr_545.xml:55: parser error : Entity 'Oslash' not defined in import.php on line 35

This is that line:

<forenames>B&Oslash;IE</forenames><x> </x>

As it is a warning, I might ignore it, but I'd like to understand what is happening.

5 Answers 5

3

HTML-entities like &Oslash is not the same as XML-entities. Here's a table for replacing HTML-entities to XML-entities.

As I can tell from one of your comments to another post, you're having trouble with an entity /. I don't know if this even is a valid HTML-entity, my Firefox won't show the character - only ouputs the entity name. But I found an other table for most entities and their character reference number. Try adding them to your replace-table and you should be safe. /'s reference number is / by the way.

Sign up to request clarification or add additional context in comments.

2 Comments

The first link is not available anymore, but the second one is working fine.
Both links are dead.
2

HTML Encoding of Latin1 characters (like Ø, what that character describes) is what has broken the XML parser. If you're in control of the data, you need to escape it using XML style character encoding (Ø just happens to be & #216;)

3 Comments

Yes, unforgiving XML parsers break when they are expecting XML-style encoding of non-ASCII characters and are given HTML-style encoding instead.
ok. So I'm just parsing this. I looked at the table from Björn's answer, and it works for my first example, but the next problem is this entity which is not in that table: &sol; . Is there a more stable solution?
XSLT transforming the document before you pass it off to an XML parser would be one solution.
2

I think this is an encoding problem. php, simplexml in this particular case, does not like the danish O you've got in that fornames tag. You could try to encode the whole file in utf-8 and removing the escaped version from the tag by that. Aferwards you can read a fully escaped character free file into simplexml.

K

5 Comments

not sure what you mean. This xml file is encoded as ISO-8859-1 (<?xml version="1.0" encoding="iso-8859-1"?>).
Right: use utf-8 instead of iso-8859-1
yepp, and make use of utf8_encode() for the actual encoding of the text.
that'd make sense if I were the author, but I'm on the parsing end so to say ;-)
You got the file, so you can read it line by line and encode it - can't you? I happend to write a xmlfilter application once for a japanese customer. And belive me, doing this extra step before the actual parsing payed... ;)
1

Just had a very similar problem and solved it in the following way. The main idea was to load a file into a string, replace all bad entities on something like "[[entity]]Oslash;" and carry out reverse replacement before displaying some xml node.

function readXML($filename){
    $xml_string = implode("", file($filename));
    $xml_string = str_replace("&", "[[entity]]", $xml_string);
    return simplexml_load_string($xml_string);
}
function xml2str($xml){
    $str = str_replace("[[entity]]", "&", (string)$xml);
    $str = iconv("UTF-8", "WINDOWS-1251", $str);
    return $str;
}
$xml = readXML($filename);
echo xml2str($xml->forenames);

iconv("UTF-8", "WINDOWS-1251", $str) as I have "WINDOWS-1251" encoding on my page

Comments

0

Try to use this line:

<forenames><![CDATA[B&Oslash;IE]]></forenames><x> </x>

and read this about CDATA

2 Comments

Before parsing you should insert CDATA tag for every entity with "strange" characters.
if it's got this error in it, then it's not valid xml to begin with. up to you to tell the original authors to fix it or do this sort of check prior to parsing and wrap the invalid chunks

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.