How to split the entire string into array in Perl

Question

I'm trying to process an entire string but the way my code is written, part of it is not being processed. Here's a representation of my code:

#!/usr/bin/perl
my $string = "MAGRSHPGPLRPLLPLLVVAACVLPGAGGTCPERALERREEEAN
              VVLTGTVEEILNVDPVQHTYSCKVRVWRYLKGKDLVARESLLDGGNKVVISGFGDPLI
              CDNQVSTGDTRIFFVNPAPPYLWPAHKNELMLNSSLMRITLRNLEEVEFCVEDKPGTH
              LRDVVVGRHPLHLLEDAVTKPELRPCPTP";

$string =~ s/\s+//g;     # remove white space from string
# split the string into fragments of 58 characters and store in array
my @array = $string =~ /[A-Z]{58}/g;   
my $len = scalar @array;
print $len . "\n";    # this prints 3
# print the fragments
print $array[0] . "\n";
print $array[1] . "\n";
print $array[2] . "\n";
print $array[3] . "\n";

The code outputs the following:

3
MAGRSHPGPLRPLLPLLVVAACVLPGAGGTCPERALERREEEANVVLTGTVEEILNVD
PVQHTYSCKVRVWRYLKGKDLVARESLLDGGNKVVISGFGDPLICDNQVSTGDTRIFF
VNPAPPYLWPAHKNELMLNSSLMRITLRNLEEVEFCVEDKPGTHLRDVVVGRHPLHLL
<blank space>

Notice that the rest of the string EDAVTKPELRPCPTP is not stored in @array. When I'm creating my array, how do I store EDAVTKPELRPCPTP? Perhaps I could store it in $array[3]?

Please don't name your variables something like @array. The @ says that it's an array; the letters are supposed to convey something useful about the purpose of its contents — Borodin
– Borodin, Commented Oct 28, 2015 at 22:19

Schwern · Accepted Answer · 2015-10-28 21:03:58Z

5

You've almost got it. You need to change your regex to allow for 1 to 58 characters.

my @array = $string =~ /[A-Z]{1,58}/g;

In addition, you have an error in your script using @prot_seq instead of @array. You should always use strict to protect yourself against this sort of thing. Here's the script with strict, warnings, and 5.10 features (to get say).

#!/usr/bin/perl

use strict;
use warnings;
use v5.10;

my $string = "MAGRSHPGPLRPLLPLLVVAACVLPGAGGTCPERALERREEEAN
              VVLTGTVEEILNVDPVQHTYSCKVRVWRYLKGKDLVARESLLDGGNKVVISGFGDPLI
              CDNQVSTGDTRIFFVNPAPPYLWPAHKNELMLNSSLMRITLRNLEEVEFCVEDKPGTH
              LRDVVVGRHPLHLLEDAVTKPELRPCPTP";

# Strip whitespace.
$string =~ s/\s+//g;

# Split the string into fragments of 58 characters or less
my @fragments = $string =~ /[A-Z]{1,58}/g;

say "Num fragments: ".scalar @fragments;
say join "\n", @fragments;

edited Oct 28, 2015 at 21:03

answered Oct 28, 2015 at 20:56

Schwern

167k28 gold badges225 silver badges370 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Axeman · Accepted Answer · 2015-10-28 21:07:08Z

2

What you're missing is the ability to capture less than 58 characters. And since you only want to do that if it's the end, you can do this:

/[A-Z]{58}|[A-Z]{1,57}\z/

Which I would prefer to write like this:

/\p{Upper}{58}|\p{Upper}{1,57}\z/

However, since this expression is greedy by default, it will prefer to gather 58 characters, and only default to less when it runs out of matching input.

/\p{Upper}{1,58}/

Or, for reasons as Schwern mentions (such as avoiding any foreign letters)

/[A-Z]{1,58}/

edited Oct 28, 2015 at 21:07

answered Oct 28, 2015 at 20:56

Axeman

29.9k2 gold badges50 silver badges104 bronze badges

5 Comments

Schwern Over a year ago

This is a case where I would recommend against using POSIX character classes. They're great when you're parsing language and want to make sure you're correctly internationalized. However, the encoding is likely specifically the ASCII characters A to Z. You don't want to pick up things like Ñ or ´E.

Borodin Over a year ago

@Schwern: Those aren't POSIX character classes, they're Unicode properties

Schwern Over a year ago

@Borodin Technically correct! Don't use either of them here.

Borodin Over a year ago

@Schwern: Although I'm not certain of the difference between \p{Upper} and \p{Lu}. And it's worth noting that Perl 5.14 and better has the /a modifer, so /\p{Lu}/a is the same as /[A-Z]/; I'm not sure which I prefer

Schwern Over a year ago

@Borodin I would very much prefer seeing the very straightforward [A-Z] vs having to go look up what \p{Lu} and /a do and requiring 5.14. /a is nice to know about though.

Borodin · Accepted Answer · 2015-10-28 22:45:08Z

2

You may prefer to use unpack, like this

$string =~ s/\s+//g;    
my @fragments = unpack '(A58)*', $string;

Or if you would rather leave $string unchanged and have v5.14 or better of Perl, then you can write

my @fragments = unpack '(A58)*', $string =~ s/\s+//gr;

edited Oct 28, 2015 at 22:45

answered Oct 28, 2015 at 22:37

Borodin

127k9 gold badges72 silver badges146 bronze badges

Comments

Matt Jacob · Accepted Answer · 2015-10-28 21:41:38Z

1

If you don't actually need regex character classes, this is how I'd do it:

use strict;
use warnings;
use Data::Dump;

my $string = "MAGRSHPGPLRPLLPLLVVAACVLPGAGGTCPERALERREEEAN
              VVLTGTVEEILNVDPVQHTYSCKVRVWRYLKGKDLVARESLLDGGNKVVISGFGDPLI
              CDNQVSTGDTRIFFVNPAPPYLWPAHKNELMLNSSLMRITLRNLEEVEFCVEDKPGTH
              LRDVVVGRHPLHLLEDAVTKPELRPCPTP";

$string =~ s/\s+//g;

my @chunks;

while (length($string)) {
    push(@chunks, substr($string, 0, 58, ''));
}

dd($string, \@chunks);

Output:

(
  "",
  [
    "MAGRSHPGPLRPLLPLLVVAACVLPGAGGTCPERALERREEEANVVLTGTVEEILNVD",
    "PVQHTYSCKVRVWRYLKGKDLVARESLLDGGNKVVISGFGDPLICDNQVSTGDTRIFF",
    "VNPAPPYLWPAHKNELMLNSSLMRITLRNLEEVEFCVEDKPGTHLRDVVVGRHPLHLL",
    "EDAVTKPELRPCPTP",
  ],
)

answered Oct 28, 2015 at 21:41

Matt Jacob

6,5732 gold badges27 silver badges27 bronze badges

2 Comments

Schwern Over a year ago

This answer assumes the data is entirely made of A-Z. It's also destructive.

Matt Jacob Over a year ago

Both of which are valid assumptions, I think, given the sample input and the fact that the original question was already modifying $string.

Collectives™ on Stack Overflow

How to split the entire string into array in Perl

4 Answers 4

Comments

5 Comments

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

5 Comments

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related