5

try to extract for substrings that match pattern in string. for example i have text like the one below

[ Pierre/NNP Vinken/NNP ]
,/, 
[ 61/CD years/NNS ]
old/JJ ,/, will/MD join/VB 
[ the/DT board/NN ]
as/IN 
[ a/DT nonexecutive/JJ director/NN Nov./NNP 29/CD ]
./. 
[ Mr./NNP Vinken/NNP ]
is/VBZ 
[ chairman/NN ]
of/IN 

and i want to extract whatever before slash (/) and whatever after slash, but somehow my regex extracts the first substring and ignore the rest of substrings in the line.

my output is something like this below :

tag:Pierre/NNP Vinken - word:Pierre/NNP Vinken/NNP ->1
tag:, - word:,/, ->1
tag:61/CD years - word:61/CD years/NNS ->1
tag:old/JJ ,/, will/MD join - word:old/JJ ,/, will/MD join/VB ->1
tag:the/DT board - word:the/DT board/NN ->1
tag:as - word:as/IN ->1
tag:a/DT nonexecutive/JJ director/NN Nov./NNP 29 - word:a/DT nonexecutive/JJ director/NN Nov./NNP 29/CD ->1
tag:. - word:./. ->1
tag:Mr./NNP Vinken - word:Mr./NNP Vinken/NNP ->1
tag:is - word:is/VBZ ->1
tag:chairman - word:chairman/NN ->1
tag:of - word:of/IN ->1

but what i am actually want is something like this below

tag:NNP  - word:Pierre ->1
tag:NNP  - word:Vinken ->1
tag:,    - word:,      ->1
tag:CD   - word:61     ->1
.
.
etc.

code i used :

    while (my $line = <$fh>) {
        chomp $line;
        #remove square brackets
        $line=~s/[\[\]]//;

        while($line =~m/((\s*(.*))\/((.*)\s+))/gi)
        {
            $word=$1;
            $tag=$2;
            #remove whitespace from left and right of string
            $word=~ s/^\s+|\s+$//g;
            $tag=~ s/^\s+|\s+$//g;
            $tags{$tag}++;
            $tagHash{$tag}{$word}++;
        }

    }
foreach my $str (sort keys %tagHash)
{
    foreach my $s (keys %{$tagHash{$str}} )
    {
        print "tags:$str - word: $s-> $tagHash{$str}{$s}\n";
    }
}

any idea why my regex does not behave as should be

EDIT:

in text files that i am parsing has wild character and punctuation as well, which is mean that files will have something like this : ''/'' "/" ,/, ./. ?/? !/! . . . etc

so i want to capture all of these things not only alphabetic and numeric characters.

6
  • 1
    that's because of the greedy nature of .* why don't you do a split on / Commented Mar 11, 2017 at 22:28
  • the whole reason because when you hit something like this [./.] where whatever on the left side of the slash is word and whatever on the right side is tag for this word and in this case both will be the same and that could be confusing if i store them to array, but i think you are right, because in this case odd positions in array will represent tags and even position will represent word. Commented Mar 11, 2017 at 22:35
  • try \b([\w\.]+?)\/([\w\.]+)\b @kero Commented Mar 11, 2017 at 23:15
  • 1
    or: while( $line =~ m!([^/\s]+)/([^/\s]+)!g ) { Commented Mar 12, 2017 at 0:08
  • 1
    Try this one: \s*([\w\.,'"]+?)\/([\w\.,'"]+)\s* @kero Commented Mar 12, 2017 at 0:21

2 Answers 2

2

I think you have tag/words that tag and word may be everything, except some characters like ],[,\s,:

\s*([^\[\]\s]+?)\/([^\[\]\s]+)\s*
    ^^^^^^^^^1

This regex is similar to your original pattern. (See DEMO)

Description:

1- This Capturing Group match every character . that is not [,] or \s

Sign up to request clarification or add additional context in comments.

Comments

1

The outer-most set of parentheses, around your whole pattern, gets captured into $1, what is clearly not intended. Also, the greediness of .*\/ means that it takes everything up to the last /. Likewise, .*\s+ leaves only the very last space.

One way to do this is by using the negated character class

my ($word, $tag) = m{ ([^/\s]+) / ([^/\s]+) }x;

The pattern [^/\s]+ matches a string of one-or-more consecutive characters, each being any other than / or whitespace. So you get a "word" before and after /. If you take "whatever after slash" as the text says it is unclear what should be before the next slash.

Your approach can then go as

while (my $line = <$fh>) 
{
    while ( $line =~ m{ ([^/\s]+) / ([^/\s]+) }gx )
    {
        $tagHash{$2}{$1}++;
    }
}

The other count seems unrelated so I left it out to focus on the question.


However, there is a big bit missing here.

This approach cannot detect when a line differs from the expected format. For example

word1/tag1 word2/tag2/ tag3/word4/tag4

produces wrong results, quietly. Some violations get skipped, but there are many bad cases.

One way to catch this is to pre-process the line, checking that there are at least two words between all slashes and at least one before first and after last. This means that each line is processed twice, and it also gets messier. For example

while (my $line = <$fh>) 
{
    my @parts = split '/', $line;
    if (not shift @parts or not pop @parts or grep { 2 > split } @parts) {
        warn "Unexpected format: $line";
        next;
    }

    $tagHash{$2}{$1}++  while $line =~ m{ ([^/\s]+) / ([^/\s]+) }gx;
}

This check changes the @parts array, so if that array is needed later then better use

if (!$parts[0] or !$parts[-1] or grep { 2 > split } @parts[1..@parts-2])  { ...

where instead of grep one can also use the short-circuiting any from List::Util

Another way would be to change the approach, and parse the line carefully instead of blindly hopping over regex matches. Since the first and last may have only one word this may be hard to do with a regex. It is probably clearer and more practical to just split and work with the array.

It is hard to imagine a format always matching data so I'd suggest to consider some of this.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.