substring with perl (Regex?)

Question

I need help to extract "BODY" part from string according to the two following cases:

Case 1:

Var1 = 
Content-Type: text/plain; charset="UTF-8"

BODY 

--000000000000ddc1610580816add


Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

BODY56 text/html

--000000000000ddc1610580816add-

Case 2:

Var1=
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

BODY

--000000000000ddc1610580816add--



Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

BODY56 text/html

--000000000000ddc1610580816add-

I want to do:

if Var1 contains: Content-Type: text/plain; charset="UTF-8" then extract text between Content-Type: text/plain; charset="UTF-8" and --000000000000ddc1610580816add

else if Var1 contains:

Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Then extract text between:

Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

And --000000000000ddc1610580816add--.

my Code, I need to fix it if someone can fix it:

 if (index($body, "Content-Type: text\/plain; charset=\"UTF-8\"\n
Content-Transfer-Encoding: quoted-printable") != -1) {
    $body =~ /Content-Type: text\/plain; charset="UTF-8"\n
Content-Transfer-Encoding: quoted-printable(.*?)--00.*/s ;
                        $body=$1;

}
    elsif   (index($body, "Content-Type: text\/plain; charset=\"UTF-8\"") != -1)
                              {
    $body =~ /Content-Type: text\/plain; charset="UTF-8"(.*?)--00.*/s ;
                        $body=$1;

}

Have you tried anything yet yourself? It looks like you are trying to parse a MIME encoded multi-part document. There are modules for this on CPAN. Doing it yourself is a bit mad. — simbabque
– simbabque, Commented Feb 1, 2019 at 10:50
I have tried several times, and I always do, I'm not strong in regex, for now the need is to use the regex because I've already used Mail :: IMAPClient (function: bodypart_string) and using the regex I can arrive at the expected results, it remains for me just this part which requires the use of regex — Red_Developper
– Red_Developper, Commented Feb 1, 2019 at 11:01
An IMAP Client is something else. You want to parse email bodies, not download email. — simbabque
– simbabque, Commented Feb 1, 2019 at 13:47
thank you Simbabque and Grinnz, Email::MIME is a good idea, I managed to extract only the text / plain from body by using Email::MIME — Red_Developper
– Red_Developper, Commented Feb 4, 2019 at 15:25

Stefan Becker · Accepted Answer · 2019-02-01 16:41:15Z

1

One solution: use /ms modifier, see perlre

#!/usr/bin/perl
use strict;
use warnings;

my $regex = qr/\AContent-Type: [^\n]+\n(?:^Content-Transfer-Encoding: [^\n]+\n)?(.+)^--.+\Z/ms;
my $body;

my $input = <<'END_OF_STRING';
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

INPUT 1 BODY

--000000000000ddc1610580816add--
END_OF_STRING

($body) = ($input =~ $regex)
    or die "mismatch in INPUT 1!\n";
print "INPUT 1 '${body}'\n";

$input = <<'END_OF_STRING';
Content-Type: text/plain; charset="UTF-8"

INPUT 2 BODY

--000000000000ddc1610580816add--
END_OF_STRING

($body) = ($input =~ $regex)
    or die "mismatch in INPUT 2!\n";
print "INPUT 2 '${body}'\n";

exit 0;

Test run:

$ perl dummy.pl
INPUT 1 '
INPUT 1 BODY

'
INPUT 2 '
INPUT 2 BODY

'

UPDATE: with the new input string provided by OP:

#!/usr/bin/perl
use strict;
use warnings;

# multipart MIME content as single string
my $input = <<'END_OF_STRING';
--0000000000007bcdff05808169f5
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

BODY text/plain

--0000000000007bcdff05808169f5
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

BODY text/html

--0000000000007bcdff05808169f5
END_OF_STRING

# split into multiple parts at the separator
foreach my $part (split(/^--[^\n]+\n/ms, $input)) {
    # skip empty parts
    next if $part =~ /\A\s*\Z/m;

    # split header and body
    my($header, $body) = split("\n\n", $part, 2);

    # Only match parts with text/plain content
    # "Content-Type" must be matched case-insensitive
    if ($header =~ m{^(?i)Content-Type(?-i):\s+text/plain[;\s]}ms) {
        print "plain text BODY: '${body}'\n";
    }
}

exit 0;

Test output:

$ perl dummy.pl
plain text BODY: 'BODY text/plain

'

edited Feb 1, 2019 at 16:41

answered Feb 1, 2019 at 11:17

Stefan Becker

6,0529 gold badges24 silver badges30 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Red_Developper Over a year ago

I want to extract only the body (without --000000...) , I want a body extraction according to both cases , the header of Var1 can be: Content-Type: text / plain; charset = "UTF-8" or Content-Type: text / plain; charset = "UTF-8 " Content-Transfer-Encoding: quoted-printable

Stefan Becker Over a year ago

I'm not quite sure I understand your comment. My answer shows the desired output shown in your question. I have updated my answer to include both input texts, maybe it becomes more clear now?

Stefan Becker Over a year ago

@Red_Developper updated with your latest input string. That makes things much simpler, BTW.

Collectives™ on Stack Overflow

substring with perl (Regex?)

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related