Perl regex on HTML markup

Question

I just want to delete the block between

<!DOCTYPE html>

and

 <body>

including those ends, using a perl regex.

Example text:

<!DOCTYPE html>


<meta charset="utf-8">
<meta name="generator" content="pandoc">
<meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes">
<title></title>
<style>code{white-space: pre;}</style>



<![endif]-->;

<body>
.
.
.
anything here

This is only a sample, my real file contains an embedded long javascript

I usually test my regex @ regex101 website and I made this one

<\!DOCTYPE html>(\n.*)*<body>

and this one that consider any space in the ends.

s/<\!DOCTYPE html>(\n.*)*<[ \t]*body[ \t]*>//gi;

It seems to work good on that website but it doesn't work when I run inside a perl script.

PERL SCRIPT (with @Jan answer):

#!/usr/bin/perl
use strict;
use warnings;

my $dirtfile = $ARGV[0];
my $cleanfile = "clean.html";

open(IN, "<", $dirtfile) or die "Can't open $dirtfile: $!";
open(OUT, ">", $cleanfile) or die "Can't open $cleanfile: $!";

while (<IN>) {
  s/(?s)<!DOCTYPE html>.+?<body>(?-s)//gi;
  print(OUT);
}

OUTPUT:

the same as input

but it doesn't work <= we'll need way more information about that — Thomas Ayoub
– Thomas Ayoub, Commented Mar 3, 2016 at 12:58

mut3 · Accepted Answer · 2019-03-20 19:02:33Z

2

I'm pretty sure you're reading the file line-by-line which should render your regex useless. I think you'll either need to read the entire file into a string and use regex that way, or edit your loop logic to remove everything before and after you see the tag.

In general, you should avoid working on HTML with regexes. Use a DOM extension instead.

edited Mar 20, 2019 at 19:02

answered Mar 3, 2016 at 13:29

mut3

544 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Sinan Ünür · Accepted Answer · 2016-03-03 14:20:19Z

1

Since you are not really parsing HTML, but instead chopping a leading part of the file, you may get away with using regular expressions. This may get much more complicated if you have the target strings in any comments etc, but, if that is not the case, simply using the flip-flop operator .. should do it:

$ perl -ne 'print unless /<!DOCTYPE html>/i .. /<body>/i' file.html</pre>

answered Mar 3, 2016 at 14:20

Sinan Ünür

118k15 gold badges201 silver badges347 bronze badges

Comments

Community · Accepted Answer · 2017-05-23 12:23:44Z

0

It is usually considered bad practice to work with regular expressions on HTML, however you could nevertheless come up with:

(?s)<!DOCTYPE html>.+?<body>(?-s)
# switches on single line mode (aka dot matches all)
# takes <!DOCTYPE>
# everything afterwards lazily (.+?)
# including the body tag
# switch off single line mode off again

See a demo on regex101.com. It won't work as expected when there's a body tag somewhere in between (including comments, that is).

edited May 23, 2017 at 12:23

CommunityBot

11 silver badge

answered Mar 3, 2016 at 12:59

Jan

43.3k11 gold badges57 silver badges87 bronze badges

1 Comment

LaboDJ Over a year ago

It seems it's not working... I updated my question with per script and output file

Collectives™ on Stack Overflow

Perl regex on HTML markup

3 Answers 3

Comments

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related