I'm loading a remote file with PHP, and then trying to parse it with DomDocument. The file contains HTML, CSS (inside a style tag), and JavaScript (inside a script tag). Then I load it by separately by passing html or css or js into the function that is parsing it. The idea is that I can use core WordPress methods to display these in the proper locations.
This is the closest I've managed to get:
libxml_use_internal_errors( true );
$document = wp_remote_retrieve_body( $response ); // this is the remote HTML file
// create a new DomDocument object
$html = new DOMDocument( '1.0', 'UTF-8' );
// load the HTML into the DomDocument object (this would be your source HTML)
$html->loadHTML( $document, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD );
if ( 'html' === $part ) {
$xpath = new DOMXPath( $html );
$remove = $xpath->query( "//*[style or script]" );
foreach ( $remove as $node ) {
$node->parentNode->removeChild($node);
}
} elseif ( 'css' === $part ) {
$xpath = new DOMXPath( $html );
$remove = $xpath->query( "//*[not(self::style)]" );
foreach ( $remove as $node ) {
$node->parentNode->removeChild($node);
}
} elseif ( 'js' === $part ) {
$xpath = new DOMXPath( $html );
$remove = $xpath->query( "//*[not(self::script)]" );
foreach ( $remove as $node ) {
$node->parentNode->removeChild($node);
}
}
ob_start();
echo $html->saveHTML();
$output = ob_get_contents();
ob_end_clean();
This results in a few problems:
- On the CSS and JavaScript output, it keeps the
styleorscripttag, and I'm trying to figure out how to remove it. - On the HTML output, it keeps the
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"><html><head></head><body>and I'd like to remove that as well.
I'm not sure if I need to take this in another direction, or if I just need a small thing to remove these wrapping elements. But I had a lot of trouble getting xpath to relate to the elements I want to keep, rather than the ones I want to remove, and that's how I've ended up where I am.