1

I'm loading a remote file with PHP, and then trying to parse it with DomDocument. The file contains HTML, CSS (inside a style tag), and JavaScript (inside a script tag). Then I load it by separately by passing html or css or js into the function that is parsing it. The idea is that I can use core WordPress methods to display these in the proper locations.

This is the closest I've managed to get:

libxml_use_internal_errors( true );
$document = wp_remote_retrieve_body( $response ); // this is the remote HTML file
// create a new DomDocument object
$html = new DOMDocument( '1.0', 'UTF-8' );
// load the HTML into the DomDocument object (this would be your source HTML)
$html->loadHTML( $document, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD );
if ( 'html' === $part ) {
    $xpath  = new DOMXPath( $html );
    $remove = $xpath->query( "//*[style or script]" );
    foreach ( $remove as $node ) {
        $node->parentNode->removeChild($node);
    }
} elseif ( 'css' === $part ) {
    $xpath  = new DOMXPath( $html );
    $remove = $xpath->query( "//*[not(self::style)]" );
    foreach ( $remove as $node ) {
        $node->parentNode->removeChild($node);
    }
} elseif ( 'js' === $part ) {
    $xpath  = new DOMXPath( $html );
    $remove = $xpath->query( "//*[not(self::script)]" );
    foreach ( $remove as $node ) {
        $node->parentNode->removeChild($node);
    }
}

ob_start();
echo $html->saveHTML();
$output = ob_get_contents();
ob_end_clean();

This results in a few problems:

  1. On the CSS and JavaScript output, it keeps the style or script tag, and I'm trying to figure out how to remove it.
  2. On the HTML output, it keeps the <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"><html><head></head><body> and I'd like to remove that as well.

I'm not sure if I need to take this in another direction, or if I just need a small thing to remove these wrapping elements. But I had a lot of trouble getting xpath to relate to the elements I want to keep, rather than the ones I want to remove, and that's how I've ended up where I am.

2 Answers 2

2

For your html case, instead of saving the whole DOMDocument, you can save just the <body> element.

libxml_use_internal_errors( true );
$document = wp_remote_retrieve_body( $response ); // this is the remote HTML file
// create a new DomDocument object
$html = new DOMDocument( '1.0', 'UTF-8' );
// load the HTML into the DomDocument object (this would be your source HTML)
$html->loadHTML( $document, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD );
if ( 'html' === $part ) {
    // get all <body> elements
    $body_elements = $html->getElementsByTagName( 'body' );
    // it is to be assumed that there is only one <body> element.
    $body = $body_elements->item( 0 );
    // get the HTML contained within that body element
    $output = $body->ownerDocument->saveHTML( $body );
} else {
    // ...
}

For the CSS and JS elements, I'm not sure why you'd need to get just their inner contents without the containing tag, but a similar approach to what we just did with $body would work: 1. select the elements, 2. foreach loop over the array of elements, 3. get each element's saved insides (I believe but am not certain this will be a DOMText object) and concatenate those strings to create your eventual $output variable.

An alternate approach for CSS and JS: take your existing approach's cluster of <script> or <tag> elements, insert them into a blank DOMDocument's <head> to save their containing <head> as an HTML string, and then enqueue that string via an anonymous function on WordPress' wp_enqueue_scripts hook:

/**
 * https://stackoverflow.com/questions/66361476/separate-html-css-and-javascript-from-file-with-domdocument?newreg=231eb52469c14d8c9c45ee9969df031a
 */
function wpse_66361476_alert() {
    $output = "<script>alert('hello');</script>"; // demonstration content
    add_action(
        'wp_enqueue_scripts',
        function() use ($output) {
            echo $output;
        }
    );
}
add_action('init', 'wpse_66361476_alert');

That approach is dangerous if you don't control the CSS and JS (and HTML) that you're outputting. It may be better to iframe in whatever you're loading here.

To improve page load speed if your host is not already using a frontend cache, you may want to look into caching the parsed elements using WordPress' caching functions. Here's a short overview; talk to your hosting provider to see if there's specific advice they have.

Sign up to request clarification or add additional context in comments.

2 Comments

This almost works! It keeps the body tag. I could just do a str_replace and get rid of it that way. It seems like there should be a nicer way, but I can work with that!
In any case, this works. If I remove the body tag, I can use this pattern to get the contents of style, script, and body. Thank you!
0

The issue is with the DomNode(s). Check out DOMDocument remove script tags from HTML source which should give you an idea how to modify your code.

1 Comment

The problem I've run into when using getElementsByTagName instead is that it keeps the plain text that's inside the various HTML elements. So in the section where I want CSS, I end up outputting CSS rules followed by plain text that used to be inside a p tag.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.