0

I need help to extract "BODY" part from string according to the two following cases:

Case 1:

Var1 = 
Content-Type: text/plain; charset="UTF-8"

BODY 

--000000000000ddc1610580816add


Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

BODY56 text/html

--000000000000ddc1610580816add-

Case 2:

Var1=
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

BODY

--000000000000ddc1610580816add--



Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

BODY56 text/html

--000000000000ddc1610580816add-

I want to do:

if Var1 contains: Content-Type: text/plain; charset="UTF-8" then extract text between Content-Type: text/plain; charset="UTF-8" and --000000000000ddc1610580816add

else if Var1 contains:

Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Then extract text between:

Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

And --000000000000ddc1610580816add--.

my Code, I need to fix it if someone can fix it:

 if (index($body, "Content-Type: text\/plain; charset=\"UTF-8\"\n
Content-Transfer-Encoding: quoted-printable") != -1) {
    $body =~ /Content-Type: text\/plain; charset="UTF-8"\n
Content-Transfer-Encoding: quoted-printable(.*?)--00.*/s ;
                        $body=$1;

}
    elsif   (index($body, "Content-Type: text\/plain; charset=\"UTF-8\"") != -1)
                              {
    $body =~ /Content-Type: text\/plain; charset="UTF-8"(.*?)--00.*/s ;
                        $body=$1;

}
5
  • Have you tried anything yet yourself? It looks like you are trying to parse a MIME encoded multi-part document. There are modules for this on CPAN. Doing it yourself is a bit mad. Commented Feb 1, 2019 at 10:50
  • I have tried several times, and I always do, I'm not strong in regex, for now the need is to use the regex because I've already used Mail :: IMAPClient (function: bodypart_string) and using the regex I can arrive at the expected results, it remains for me just this part which requires the use of regex Commented Feb 1, 2019 at 11:01
  • An IMAP Client is something else. You want to parse email bodies, not download email. Commented Feb 1, 2019 at 13:47
  • I recommend Email::MIME. Commented Feb 1, 2019 at 21:36
  • thank you Simbabque and Grinnz, Email::MIME is a good idea, I managed to extract only the text / plain from body by using Email::MIME Commented Feb 4, 2019 at 15:25

1 Answer 1

1

One solution: use /ms modifier, see perlre

#!/usr/bin/perl
use strict;
use warnings;

my $regex = qr/\AContent-Type: [^\n]+\n(?:^Content-Transfer-Encoding: [^\n]+\n)?(.+)^--.+\Z/ms;
my $body;

my $input = <<'END_OF_STRING';
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

INPUT 1 BODY

--000000000000ddc1610580816add--
END_OF_STRING

($body) = ($input =~ $regex)
    or die "mismatch in INPUT 1!\n";
print "INPUT 1 '${body}'\n";

$input = <<'END_OF_STRING';
Content-Type: text/plain; charset="UTF-8"

INPUT 2 BODY

--000000000000ddc1610580816add--
END_OF_STRING

($body) = ($input =~ $regex)
    or die "mismatch in INPUT 2!\n";
print "INPUT 2 '${body}'\n";

exit 0;

Test run:

$ perl dummy.pl
INPUT 1 '
INPUT 1 BODY

'
INPUT 2 '
INPUT 2 BODY

'

UPDATE: with the new input string provided by OP:

#!/usr/bin/perl
use strict;
use warnings;

# multipart MIME content as single string
my $input = <<'END_OF_STRING';
--0000000000007bcdff05808169f5
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

BODY text/plain

--0000000000007bcdff05808169f5
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

BODY text/html

--0000000000007bcdff05808169f5
END_OF_STRING

# split into multiple parts at the separator
foreach my $part (split(/^--[^\n]+\n/ms, $input)) {
    # skip empty parts
    next if $part =~ /\A\s*\Z/m;

    # split header and body
    my($header, $body) = split("\n\n", $part, 2);

    # Only match parts with text/plain content
    # "Content-Type" must be matched case-insensitive
    if ($header =~ m{^(?i)Content-Type(?-i):\s+text/plain[;\s]}ms) {
        print "plain text BODY: '${body}'\n";
    }
}

exit 0;

Test output:

$ perl dummy.pl
plain text BODY: 'BODY text/plain

'
Sign up to request clarification or add additional context in comments.

3 Comments

I want to extract only the body (without --000000...) , I want a body extraction according to both cases , the header of Var1 can be: Content-Type: text / plain; charset = "UTF-8" or Content-Type: text / plain; charset = "UTF-8 " Content-Transfer-Encoding: quoted-printable
I'm not quite sure I understand your comment. My answer shows the desired output shown in your question. I have updated my answer to include both input texts, maybe it becomes more clear now?
@Red_Developper updated with your latest input string. That makes things much simpler, BTW.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.