0

I'm trying to process an entire string but the way my code is written, part of it is not being processed. Here's a representation of my code:

#!/usr/bin/perl
my $string = "MAGRSHPGPLRPLLPLLVVAACVLPGAGGTCPERALERREEEAN
              VVLTGTVEEILNVDPVQHTYSCKVRVWRYLKGKDLVARESLLDGGNKVVISGFGDPLI
              CDNQVSTGDTRIFFVNPAPPYLWPAHKNELMLNSSLMRITLRNLEEVEFCVEDKPGTH
              LRDVVVGRHPLHLLEDAVTKPELRPCPTP";

$string =~ s/\s+//g;     # remove white space from string
# split the string into fragments of 58 characters and store in array
my @array = $string =~ /[A-Z]{58}/g;   
my $len = scalar @array;
print $len . "\n";    # this prints 3
# print the fragments
print $array[0] . "\n";
print $array[1] . "\n";
print $array[2] . "\n";
print $array[3] . "\n";

The code outputs the following:

3
MAGRSHPGPLRPLLPLLVVAACVLPGAGGTCPERALERREEEANVVLTGTVEEILNVD
PVQHTYSCKVRVWRYLKGKDLVARESLLDGGNKVVISGFGDPLICDNQVSTGDTRIFF
VNPAPPYLWPAHKNELMLNSSLMRITLRNLEEVEFCVEDKPGTHLRDVVVGRHPLHLL
<blank space> 

Notice that the rest of the string EDAVTKPELRPCPTP is not stored in @array. When I'm creating my array, how do I store EDAVTKPELRPCPTP? Perhaps I could store it in $array[3]?

1
  • Please don't name your variables something like @array. The @ says that it's an array; the letters are supposed to convey something useful about the purpose of its contents Commented Oct 28, 2015 at 22:19

4 Answers 4

5

You've almost got it. You need to change your regex to allow for 1 to 58 characters.

my @array = $string =~ /[A-Z]{1,58}/g;

In addition, you have an error in your script using @prot_seq instead of @array. You should always use strict to protect yourself against this sort of thing. Here's the script with strict, warnings, and 5.10 features (to get say).

#!/usr/bin/perl

use strict;
use warnings;
use v5.10;

my $string = "MAGRSHPGPLRPLLPLLVVAACVLPGAGGTCPERALERREEEAN
              VVLTGTVEEILNVDPVQHTYSCKVRVWRYLKGKDLVARESLLDGGNKVVISGFGDPLI
              CDNQVSTGDTRIFFVNPAPPYLWPAHKNELMLNSSLMRITLRNLEEVEFCVEDKPGTH
              LRDVVVGRHPLHLLEDAVTKPELRPCPTP";

# Strip whitespace.
$string =~ s/\s+//g;

# Split the string into fragments of 58 characters or less
my @fragments = $string =~ /[A-Z]{1,58}/g;

say "Num fragments: ".scalar @fragments;
say join "\n", @fragments;
Sign up to request clarification or add additional context in comments.

Comments

2

What you're missing is the ability to capture less than 58 characters. And since you only want to do that if it's the end, you can do this:

/[A-Z]{58}|[A-Z]{1,57}\z/

Which I would prefer to write like this:

/\p{Upper}{58}|\p{Upper}{1,57}\z/

However, since this expression is greedy by default, it will prefer to gather 58 characters, and only default to less when it runs out of matching input.

/\p{Upper}{1,58}/

Or, for reasons as Schwern mentions (such as avoiding any foreign letters)

/[A-Z]{1,58}/

5 Comments

This is a case where I would recommend against using POSIX character classes. They're great when you're parsing language and want to make sure you're correctly internationalized. However, the encoding is likely specifically the ASCII characters A to Z. You don't want to pick up things like Ñ or ´E.
@Schwern: Those aren't POSIX character classes, they're Unicode properties
@Borodin Technically correct! Don't use either of them here.
@Schwern: Although I'm not certain of the difference between \p{Upper} and \p{Lu}. And it's worth noting that Perl 5.14 and better has the /a modifer, so /\p{Lu}/a is the same as /[A-Z]/; I'm not sure which I prefer
@Borodin I would very much prefer seeing the very straightforward [A-Z] vs having to go look up what \p{Lu} and /a do and requiring 5.14. /a is nice to know about though.
2

You may prefer to use unpack, like this

$string =~ s/\s+//g;    
my @fragments = unpack '(A58)*', $string;

Or if you would rather leave $string unchanged and have v5.14 or better of Perl, then you can write

my @fragments = unpack '(A58)*', $string =~ s/\s+//gr;

Comments

1

If you don't actually need regex character classes, this is how I'd do it:

use strict;
use warnings;
use Data::Dump;

my $string = "MAGRSHPGPLRPLLPLLVVAACVLPGAGGTCPERALERREEEAN
              VVLTGTVEEILNVDPVQHTYSCKVRVWRYLKGKDLVARESLLDGGNKVVISGFGDPLI
              CDNQVSTGDTRIFFVNPAPPYLWPAHKNELMLNSSLMRITLRNLEEVEFCVEDKPGTH
              LRDVVVGRHPLHLLEDAVTKPELRPCPTP";

$string =~ s/\s+//g;

my @chunks;

while (length($string)) {
    push(@chunks, substr($string, 0, 58, ''));
}

dd($string, \@chunks);

Output:

(
  "",
  [
    "MAGRSHPGPLRPLLPLLVVAACVLPGAGGTCPERALERREEEANVVLTGTVEEILNVD",
    "PVQHTYSCKVRVWRYLKGKDLVARESLLDGGNKVVISGFGDPLICDNQVSTGDTRIFF",
    "VNPAPPYLWPAHKNELMLNSSLMRITLRNLEEVEFCVEDKPGTHLRDVVVGRHPLHLL",
    "EDAVTKPELRPCPTP",
  ],
)

2 Comments

This answer assumes the data is entirely made of A-Z. It's also destructive.
Both of which are valid assumptions, I think, given the sample input and the fact that the original question was already modifying $string.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.