0

I am looking to substitute anything that is not an HTML tag from an HTML document. So, basically trying to get rid of all the text within the document.

I have the below regex to remove all HTML from a string, but need help with the opposite scenario.

$string =~ s/<[^>]+>//g;

Thanks.

9
  • 10
    NooooooooOOOOooOOOOoooOOoooo!!!! stackoverflow.com/questions/1732348/… Commented Oct 22, 2013 at 23:00
  • 5
    Please don't do this. This is the way to madness Commented Oct 22, 2013 at 23:02
  • 1
    What is not an HTML tag in an HTML document? If it's well-formed, everything except comments goes inside a tag of some sort. Are you looking for text inside the body not inside another tag? Commented Oct 22, 2013 at 23:12
  • @Ethan Brown: Yes, looking to eliminate the text that is not within an HTML tag. Commented Oct 22, 2013 at 23:19
  • 2
    You didn't really answer my question. For example, if this is your document: <html><body>Here's some <b>bold</b> text!</body></html>, are you looking for the strings "Here's some " and " text!"? Because neither of those strings are outside of an HTML tag (they're both inside the <body> tag). Commented Oct 22, 2013 at 23:23

4 Answers 4

1

If this is regex s///ubstitution to remove all html from document

$string =~ s/<[^>]+>//g;

Then you can use the same regex in a m//atch operator to keep all html from document

$string = join '', $string =~ m/<[^>]+>/g;

If the above regex satisfies your requirements, then you're done :) But maybe you want to consider this ol' regex pattern, slightly longer :D http://perlmonks.org/?node_id=161281 Mind the caveats like Ethan Browne mentions :)

Sign up to request clarification or add additional context in comments.

2 Comments

This idea (extracting all tags) is better than deleting anything between tags. However, your regex fails for <!-- > --><!-- > or <script> 3 < 4 </script><script>< 4 </script>. Still +1 for linking to a better regex.
:) you already said that amon, its the OPs regex unchanged :)
1

Ethan Brown namechecks HTML::DOM as if it were the only CPAN solution.

HTML::Parser is more ubiquitous, but it's not hard to Google for more.

http://metacpan.org/pod/HTML::Parser

A solution using HTML::Parser is (tested once):

use HTML::Parser ();

my $p = HTML::Parser->new(api_version => 3);
$p->handler( text => sub { }, "");
$p->handler( default => sub { print shift }, "text");
$p->parse_file('content.html') || die $!;

Comments

0

Are you looking for this?

$string =~ s/>[^<]*</></mg;

Or this?

$string =~ s/(?<=>)[^<]*(?=<)//mg;

1 Comment

Your solution fails on comments like <!-- > --><p>--><p> and on script tags like <script> 2 < 4 </script><script>< 4 </script>. Also, text at the end of a document without explicit head or body isn't removed: <h1>Headline</h1><p>Text until EOF<h1></h1><p>Text until EOF
0

LibXML makes it easy to select stuff that isn't tags/comments/processing-instruction and remove it

#!/usr/bin/perl --
use strict;
use warnings;
use XML::LibXML 1.70; ## for load_html/load_xml/location
use XML::LibXML::PrettyPrint;

Main( @ARGV );
exit( 0 );
sub Main {
    binmode STDOUT;
    my $loc = shift or die "
Usage:
    $0  ko00010.html
    $0  http://example.com/ko00010.html\n\n";

    my $dom = XML::LibXML->new(
        qw/
          recover 2
          no_blanks 1
          /
    )->load_html( location => $loc, );

## http://www.w3.org/TR/xpath/#node-tests
## http://www.w3.org/TR/xpath/#NT-NodeType
## http://www.w3.org/TR/xpath/#section-Text-Nodes
    for my $text ( $dom->findnodes(q{ //text() }) ){
        node_detach( $text );
    }


    local $XML::LibXML::skipXMLDeclaration = 1; ## <?xml ?>
    local $XML::LibXML::setTagCompression = 0;  ## <p />

#~     print "$dom";

    my $pp  = XML::LibXML::PrettyPrint->new_for_html;
    $pp->{indent_string}=' ';
    print $pp->pretty_print( $dom );
}
sub node_detach {
    my( $self ) = @_;
    $self->parentNode->removeChild( $self );
}

1 Comment

It's worth noting that any compliant DOM-based solution will wrap the HTML fragment inside a minimal <html><body>... fragment. This parser also sticks to HTML4 semantics (in contrast to HTML5), and will introduce closing tags where there weren't any in our input.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.