Separate HTML, CSS, and JavaScript from file with DomDocument

Question

I'm loading a remote file with PHP, and then trying to parse it with DomDocument. The file contains HTML, CSS (inside a style tag), and JavaScript (inside a script tag). Then I load it by separately by passing html or css or js into the function that is parsing it. The idea is that I can use core WordPress methods to display these in the proper locations.

This is the closest I've managed to get:

libxml_use_internal_errors( true );
$document = wp_remote_retrieve_body( $response ); // this is the remote HTML file
// create a new DomDocument object
$html = new DOMDocument( '1.0', 'UTF-8' );
// load the HTML into the DomDocument object (this would be your source HTML)
$html->loadHTML( $document, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD );
if ( 'html' === $part ) {
    $xpath  = new DOMXPath( $html );
    $remove = $xpath->query( "//*[style or script]" );
    foreach ( $remove as $node ) {
        $node->parentNode->removeChild($node);
    }
} elseif ( 'css' === $part ) {
    $xpath  = new DOMXPath( $html );
    $remove = $xpath->query( "//*[not(self::style)]" );
    foreach ( $remove as $node ) {
        $node->parentNode->removeChild($node);
    }
} elseif ( 'js' === $part ) {
    $xpath  = new DOMXPath( $html );
    $remove = $xpath->query( "//*[not(self::script)]" );
    foreach ( $remove as $node ) {
        $node->parentNode->removeChild($node);
    }
}

ob_start();
echo $html->saveHTML();
$output = ob_get_contents();
ob_end_clean();

This results in a few problems:

On the CSS and JavaScript output, it keeps the style or script tag, and I'm trying to figure out how to remove it.
On the HTML output, it keeps the <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"><html><head></head><body> and I'd like to remove that as well.

I'm not sure if I need to take this in another direction, or if I just need a small thing to remove these wrapping elements. But I had a lot of trouble getting xpath to relate to the elements I want to keep, rather than the ones I want to remove, and that's how I've ended up where I am.

Jonathan Stegall · Accepted Answer · 2021-03-03 04:44:43Z

2

For your html case, instead of saving the whole DOMDocument, you can save just the <body> element.

libxml_use_internal_errors( true );
$document = wp_remote_retrieve_body( $response ); // this is the remote HTML file
// create a new DomDocument object
$html = new DOMDocument( '1.0', 'UTF-8' );
// load the HTML into the DomDocument object (this would be your source HTML)
$html->loadHTML( $document, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD );
if ( 'html' === $part ) {
    // get all <body> elements
    $body_elements = $html->getElementsByTagName( 'body' );
    // it is to be assumed that there is only one <body> element.
    $body = $body_elements->item( 0 );
    // get the HTML contained within that body element
    $output = $body->ownerDocument->saveHTML( $body );
} else {
    // ...
}

For the CSS and JS elements, I'm not sure why you'd need to get just their inner contents without the containing tag, but a similar approach to what we just did with $body would work: 1. select the elements, 2. foreach loop over the array of elements, 3. get each element's saved insides (I believe but am not certain this will be a DOMText object) and concatenate those strings to create your eventual $output variable.

An alternate approach for CSS and JS: take your existing approach's cluster of <script> or <tag> elements, insert them into a blank DOMDocument's <head> to save their containing <head> as an HTML string, and then enqueue that string via an anonymous function on WordPress' wp_enqueue_scripts hook:

/**
 * https://stackoverflow.com/questions/66361476/separate-html-css-and-javascript-from-file-with-domdocument?newreg=231eb52469c14d8c9c45ee9969df031a
 */
function wpse_66361476_alert() {
    $output = "<script>alert('hello');</script>"; // demonstration content
    add_action(
        'wp_enqueue_scripts',
        function() use ($output) {
            echo $output;
        }
    );
}
add_action('init', 'wpse_66361476_alert');

That approach is dangerous if you don't control the CSS and JS (and HTML) that you're outputting. It may be better to iframe in whatever you're loading here.

To improve page load speed if your host is not already using a frontend cache, you may want to look into caching the parsed elements using WordPress' caching functions. Here's a short overview; talk to your hosting provider to see if there's specific advice they have.

edited Mar 3, 2021 at 4:44

Jonathan Stegall

5301 gold badge7 silver badges23 bronze badges

answered Mar 2, 2021 at 1:40

Ben Keith

362 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Jonathan Stegall Over a year ago

This almost works! It keeps the body tag. I could just do a str_replace and get rid of it that way. It seems like there should be a nicer way, but I can work with that!

Jonathan Stegall Over a year ago

In any case, this works. If I remove the body tag, I can use this pattern to get the contents of style, script, and body. Thank you!

Herbert Peters · Accepted Answer · 2021-02-25 02:48:30Z

0

The issue is with the DomNode(s). Check out DOMDocument remove script tags from HTML source which should give you an idea how to modify your code.

answered Feb 25, 2021 at 2:48

Herbert Peters

1791 silver badge6 bronze badges

1 Comment

Jonathan Stegall Over a year ago

The problem I've run into when using getElementsByTagName instead is that it keeps the plain text that's inside the various HTML elements. So in the section where I want CSS, I end up outputting CSS rules followed by plain text that used to be inside a p tag.

Collectives™ on Stack Overflow

Separate HTML, CSS, and JavaScript from file with DomDocument

2 Answers 2

2 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related