1

What's the single regex that enables me to capture all the text that goes after are genes and is gene from this text

The closest human genes of best are genes A B C
The closest human gene of best is gene A 

Hence I hope to extract $1 that contain

A B C
A 

Tried this but fail:

$line =~ /The closest .* gene[s] (.*)$/;
1
  • Do you also need to avoid illegal strings like "... gene of best are A"? Commented Apr 14, 2010 at 11:38

5 Answers 5

4
$line =~ /The closest .* genes? (.*)$/;
Sign up to request clarification or add additional context in comments.

1 Comment

+1 for matching requester's example as close as possible, but this could benefit from some information explaining that [s] is the same as s, [s ] would have been what he was trying to accomplish with that, and that s? is equivalent.
3

I think the most explicit is:

$line =~ m/best \s (?:is \s gene|are \s genes) \s ([\p{IsUpper}](?: \s [\p{IsUpper} ])*)/x;

Of course if you know that all sentences are going to be grammatical, then you can do the (?:are|is) thing. And if you know that you're only going to have genes A-N or something, you can forget the \p{IsUpper} and use [A-N].

Comments

2
$ perl -F/genes*/ -ane 'print $F[-1];' file
 A B C
 A

Comments

2

Use non-greedy at the beginning to reduce the opportunities for surprises. Use non-capturing parens to group alternatives that you don't care about. Append ? to a letter to make it optional. Hence, try this:

$line =~ /The closest .*? (?:is|are) genes? (.*)$/;

To see where you were going wrong BTW, just compare the above with what you were originally trying.

5 Comments

It captures some cases that are bad grammar too (“The closest ... is genes ..”) but that's hardly important, yes? :-)
if it's not important why bother with that non-capturing group at all?
@SilentGhost: Without it, you'll capture from the first instance of the word "gene" to the end, e.g., “of best are genes A B C”.
that's only because of using non-greedy quantifier
There's not really enough input data samples in the question to be able to work out what is wanted. I personally prefer to match more in the fixed proportion to reduce the number of landmines^Wsurprises in the matched text.
0

With the other suggestions, I would like to suggest to have a look at the perllre for Regular Expressions

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.