Perl Regex for Not HTML

Question

I am looking to substitute anything that is not an HTML tag from an HTML document. So, basically trying to get rid of all the text within the document.

I have the below regex to remove all HTML from a string, but need help with the opposite scenario.

$string =~ s/<[^>]+>//g;

Thanks.

NooooooooOOOOooOOOOoooOOoooo!!!! stackoverflow.com/questions/1732348/… — meda
– meda, Commented Oct 22, 2013 at 23:00
What is not an HTML tag in an HTML document? If it's well-formed, everything except comments goes inside a tag of some sort. Are you looking for text inside the body not inside another tag? — Ethan Brown
– Ethan Brown, Commented Oct 22, 2013 at 23:12
@Ethan Brown: Yes, looking to eliminate the text that is not within an HTML tag. — user333746
– user333746, Commented Oct 22, 2013 at 23:19
You didn't really answer my question. For example, if this is your document: <html><body>Here's some <b>bold</b> text!</body></html>, are you looking for the strings "Here's some " and " text!"? Because neither of those strings are outside of an HTML tag (they're both inside the <body> tag). — Ethan Brown
– Ethan Brown, Commented Oct 22, 2013 at 23:23

optional · Accepted Answer · 2013-10-24 07:52:29Z

1

If this is regex s///ubstitution to remove all html from document

$string =~ s/<[^>]+>//g;

Then you can use the same regex in a m//atch operator to keep all html from document

$string = join '', $string =~ m/<[^>]+>/g;

If the above regex satisfies your requirements, then you're done :) But maybe you want to consider this ol' regex pattern, slightly longer :D http://perlmonks.org/?node_id=161281 Mind the caveats like Ethan Browne mentions :)

edited Oct 24, 2013 at 7:52

answered Oct 24, 2013 at 7:41

optional

2,07112 silver badges16 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

amon Over a year ago

This idea (extracting all tags) is better than deleting anything between tags. However, your regex fails for  → <!-- > or <script> 3 < 4 </script> → <script>< 4 </script>. Still +1 for linking to a better regex.

optional Over a year ago

:) you already said that amon, its the OPs regex unchanged :)

szabgab · Accepted Answer · 2014-04-05 13:46:00Z

1

Ethan Brown namechecks HTML::DOM as if it were the only CPAN solution.

HTML::Parser is more ubiquitous, but it's not hard to Google for more.

http://metacpan.org/pod/HTML::Parser

A solution using HTML::Parser is (tested once):

use HTML::Parser ();

my $p = HTML::Parser->new(api_version => 3);
$p->handler( text => sub { }, "");
$p->handler( default => sub { print shift }, "text");
$p->parse_file('content.html') || die $!;

edited Apr 5, 2014 at 13:46

szabgab

6,31211 gold badges56 silver badges64 bronze badges

answered Oct 23, 2013 at 19:13

ashley

5665 silver badges13 bronze badges

Comments

traybold · Accepted Answer · 2013-10-22 23:45:52Z

0

Are you looking for this?

$string =~ s/>[^<]*</></mg;

Or this?

$string =~ s/(?<=>)[^<]*(?=<)//mg;

answered Oct 22, 2013 at 23:45

traybold

4444 silver badges4 bronze badges

1 Comment

amon Over a year ago

Your solution fails on comments like <p> → --><p> and on script tags like <script> 2 < 4 </script> → <script>< 4 </script>. Also, text at the end of a document without explicit head or body isn't removed: <h1>Headline</h1><p>Text until EOF → <h1></h1><p>Text until EOF

optional · Accepted Answer · 2013-10-24 08:28:18Z

0

LibXML makes it easy to select stuff that isn't tags/comments/processing-instruction and remove it

#!/usr/bin/perl --
use strict;
use warnings;
use XML::LibXML 1.70; ## for load_html/load_xml/location
use XML::LibXML::PrettyPrint;

Main( @ARGV );
exit( 0 );
sub Main {
    binmode STDOUT;
    my $loc = shift or die "
Usage:
    $0  ko00010.html
    $0  http://example.com/ko00010.html\n\n";

    my $dom = XML::LibXML->new(
        qw/
          recover 2
          no_blanks 1
          /
    )->load_html( location => $loc, );

## http://www.w3.org/TR/xpath/#node-tests
## http://www.w3.org/TR/xpath/#NT-NodeType
## http://www.w3.org/TR/xpath/#section-Text-Nodes
    for my $text ( $dom->findnodes(q{ //text() }) ){
        node_detach( $text );
    }


    local $XML::LibXML::skipXMLDeclaration = 1; ## <?xml ?>
    local $XML::LibXML::setTagCompression = 0;  ## <p />

#~     print "$dom";

    my $pp  = XML::LibXML::PrettyPrint->new_for_html;
    $pp->{indent_string}=' ';
    print $pp->pretty_print( $dom );
}
sub node_detach {
    my( $self ) = @_;
    $self->parentNode->removeChild( $self );
}

answered Oct 24, 2013 at 8:28

optional

2,07112 silver badges16 bronze badges

1 Comment

amon Over a year ago

It's worth noting that any compliant DOM-based solution will wrap the HTML fragment inside a minimal <html><body>... fragment. This parser also sticks to HTML4 semantics (in contrast to HTML5), and will introduce closing tags where there weren't any in our input.

Collectives™ on Stack Overflow

Perl Regex for Not HTML

4 Answers 4

2 Comments

Comments

1 Comment

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

Comments

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related