3

I'm parsing a very large Xml files, so I need to use the XMLReader of PHP. They cannot be modified from the source. So they have to be parsed as they are. The problem is that the documents contain html chars "&#" inside that the reader detect as not valid.


        $reader = new XMLReader();
    
        if (!$reader->open($fileNamePath))//File xml
            {
            echo "Error opening file: $fileNamePath".PHP_EOL;
            continue;
            }
        echo "Processing file: $file".PHP_EOL;
       
           
        while($reader->read()) 
            {
            
            if ($reader->nodeType == XMLReader::ELEMENT && $reader->name == 'AIUTO') 
                {
                
                try {
                    $input =$reader->readOuterXML();
                    $nodeAiuto = new SimpleXMLElement($input);
                    }
                catch(Exception $e)
                    {
                    echo "Error Node AIUTO ".$e->getMessage().PHP_EOL;
                    continue;
                    }
                //Do stuff here
                }
         }
    
         $reader->close();

I get a lot of messages like this:

PHP Warning: XMLReader::readOuterXml(): myfile.xml:162: parser error : xmlParseCharRef: invalid xmlChar value 2... Errore Nodo AIUTO String could not be parsed as XML

Obviously the file contains the sequence .

here some xml file code causing the error:

<AIUTO><BASE_GIURIDICA_NAZIONALE>Quadro riepilogativo delle misure a sostegno delle imprese attive nei settori agricolo, forestale, della pesca 
e acquacoltura ai sensi della Comunicazione della Commissione europea C (2020) 1863 final – “Quadro 
temporaneo per le misure di aiuto di Stato a sostegno dell’economia nell’attuale emergenza del COVID&#2;19” e successive modifiche e integrazioni</BASE_GIURIDICA_NAZIONALE></AIUTO>

I thought to parse every file as text, line by line, and replace the invalid sequences.

But it's a little tricky. Has someone a better solution?

1

3 Answers 3

0

What you can do is to build a custom stream filter in which you proceed to all the fix you need. This way you can continue to read the file as a stream with XMLReader without to load the full content at one time.

class fix_entities_filter extends php_user_filter
{
    function filter($in, $out, &$consumed, $closing): int
    {
        while ($bucket = stream_bucket_make_writeable($in)) {
            $bucket->data = $this->fix($bucket->data);
            $consumed += $bucket->datalen;
            stream_bucket_append($out, $bucket);
        }
        return PSFS_PASS_ON;
    }
    
    function fix($data)
    {
        return strtr($data, ['&#2;' => '&#x202f;']);
    }
}

stream_filter_register("fix_entities", "fix_entities_filter")
    or die("Failed to register filter");

$file = 'file.xml';
$fileNamePath = "/path/to/your/$file";
$path = "php://filter/read=fix_entities/resource=$fileNamePath";

$reader = new XMLReader();
    
if (!$reader->open($path)) {
    echo "Error opening file: $fileNamePath", PHP_EOL;
}

demo

You can find more informations about stream filters in the PHP manual and also in the book "Modern PHP by Josh Lockhart - O'Reilly".

Sign up to request clarification or add additional context in comments.

5 Comments

In PHP under 8 throws a declaration warning. But the real problem of the solution is that, as it is, it outputs the whole file. Giga of text... And a lot of time to process the file.
@Jenemj: to avoid the warning with old PHP versions remove : int (that is mandatory with PHP 8 to avoid a warning too). This solution, contrary to what you seem to think, doesn't load the whole file but only "buckets" of 8192 bytes, so the memory used will be ridiculous.
I didn't say this solution loads the whole file, just it outputs it...
@Jenemj: It doesn't "output the file", I added a var_dump() only in the demo for the newly created simpleXML object and only for the demo, to show what happens (strange that you didn't note that). Also, this solution is about 2x faster.
@Jenemj: Do you have the possibility to compress the file with gzip before processing it?
-1

Been there with an xml file and I found that the best workaround is to replace the string with nothing:

$xml= str_replace('YOUR STIRNG',NULL,$xml);

If you can't delete the data in xml, you can try to parse the xml then loop each one with:

$xml= simplexml_load_file('file.xml');
foreach($xml as $object){
  your code...
}

7 Comments

I cannot use simplexml for the whole file...It's too big.
And what about str_replace()?
The point is that the error raises on the $reader->readOuterXML() line, before I can take any action on the wrong string. The only way to use the str_replace() is to parse the file before as a text.
Of course, you need to modify the string before passing to the reader. Load the text to variable with $string=file_get_contents($file) then do str_replace and pass the result to the reader.
Yes, but file_get_contents() cannot be use for the same reason of simpleXml: my XML is huge. Your Idea is right but I had to do the same thing line by line.
|
-1

Waiting for a cleaner working solution for now I used my "dirty thought".

I created a temp xml removing line by line the sequences causing errors.

This is working:

$fileNamePath = "/path/to/your/file.xml";
$fileNamePathTmp = "/path/to/your/tmp.xml"

$handle = fopen($fileNamePath, "r");
$handle2 = fopen($fileNamePathTmp, "w");
if ($handle) {
while (($line = fgets($handle)) !== false) {
    $line2=str_replace(array("&#2;","&#11;","&#16;","&#26;"),"",$line);
    fputs($handle2,$line2);
}

fclose($handle);
fclose($handle2);
}

$reader = new XMLReader();

if (!$reader->open($fileNamePathTmp))//File xml tmp
    {
    echo "Error opening file: $fileNamePath".PHP_EOL;
    continue;
    }
echo "Processing file: $file".PHP_EOL;

   
while($reader->read()) 
    {
    
    if ($reader->nodeType == XMLReader::ELEMENT && $reader->name == 'AIUTO') 
        {
        
        try {
            $input =$reader->readOuterXML();
            $nodeAiuto = new SimpleXMLElement($input);
            }
        catch(Exception $e)
            {
            echo "Error Node AIUTO ".$e->getMessage().PHP_EOL;
            continue;
            }
        //Do stuff here
        }
 }

 $reader->close();
 unlink($fileNamePathTmp);//Remove the temp xml

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.