0

I just want to delete the block between

<!DOCTYPE html>

and

 <body>

including those ends, using a perl regex.

Example text:

<!DOCTYPE html>


<meta charset="utf-8">
<meta name="generator" content="pandoc">
<meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes">
<title></title>
<style>code{white-space: pre;}</style>



<![endif]-->;

<body>
.
.
.
anything here

This is only a sample, my real file contains an embedded long javascript

I usually test my regex @ regex101 website and I made this one

<\!DOCTYPE html>(\n.*)*<body>

and this one that consider any space in the ends.

s/<\!DOCTYPE html>(\n.*)*<[ \t]*body[ \t]*>//gi;

It seems to work good on that website but it doesn't work when I run inside a perl script.

PERL SCRIPT (with @Jan answer):

#!/usr/bin/perl
use strict;
use warnings;

my $dirtfile = $ARGV[0];
my $cleanfile = "clean.html";

open(IN, "<", $dirtfile) or die "Can't open $dirtfile: $!";
open(OUT, ">", $cleanfile) or die "Can't open $cleanfile: $!";

while (<IN>) {
  s/(?s)<!DOCTYPE html>.+?<body>(?-s)//gi;
  print(OUT);
}

OUTPUT:

the same as input
3
  • 1
    but it doesn't work <= we'll need way more information about that Commented Mar 3, 2016 at 12:58
  • Ok... I'm going to add more informations Commented Mar 3, 2016 at 12:58
  • 1
    Use an html parser and extract all between body tags. Commented Mar 3, 2016 at 13:01

3 Answers 3

2

I'm pretty sure you're reading the file line-by-line which should render your regex useless. I think you'll either need to read the entire file into a string and use regex that way, or edit your loop logic to remove everything before and after you see the tag.

In general, you should avoid working on HTML with regexes. Use a DOM extension instead.

Sign up to request clarification or add additional context in comments.

Comments

1

Since you are not really parsing HTML, but instead chopping a leading part of the file, you may get away with using regular expressions. This may get much more complicated if you have the target strings in any comments etc, but, if that is not the case, simply using the flip-flop operator .. should do it:

$ perl -ne 'print unless /<!DOCTYPE html>/i .. /<body>/i' file.html</pre>

Comments

0

It is usually considered bad practice to work with regular expressions on HTML, however you could nevertheless come up with:

(?s)<!DOCTYPE html>.+?<body>(?-s)
# switches on single line mode (aka dot matches all)
# takes <!DOCTYPE>
# everything afterwards lazily (.+?)
# including the body tag
# switch off single line mode off again

See a demo on regex101.com. It won't work as expected when there's a body tag somewhere in between (including comments, that is).

1 Comment

It seems it's not working... I updated my question with per script and output file

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.