4

I have a text corpus of 11 files each having about 190000 lines. I have 10 strings one or more of which may appear in each line the above corpus.

When I encounter any of the 10 strings, I need to record that string which appears in the line separately. The brute force way of looping through the regex for every line and marking it is taking a long time. Is there an efficient way of doing this?

I found a post (Match a line with multiple regex using Python) which provides a TRUE or FALSE output. But how do I record the matching regex from the line:

any(regex.match(line) for regex in [regex1, regex2, regex3])

Edit: adding example

regex = ['quick','brown','fox']
line1 = "quick brown fox jumps on the lazy dog" # i need to be able to record all of quick, brown and fox
line2 = "quick dog and brown rabbit ran together" # i should record quick and brown
line3 = "fox was quick an rabit was slow" # i should be able to record quick and fox.

Looping through the regex and recording the matching one is one of the solutions, but looking at the scale (11 * 190000 * 10), my script is running for a while now. i need to repeat this in my work quite many times. so i was looking at a more efficient way.

8
  • 1
    What are the regex that you're trying to match? You can probably combine them into 1 regex pretty easily ... Commented Oct 23, 2012 at 12:33
  • I think you need to provide a more detailed explanation of what you're actually trying to do. I don't understand "record that string which appears in the line separately" - what exactly do you mean by "record"? Do you want to record the match, the regex that matched, the line where the regex matched, the position in the line where the regex matched? What if there are matches on more than one line? Does that matter? Etc. Commented Oct 23, 2012 at 12:50
  • @TimPietzcker hope the additional information in the edit helps explain my problem? Commented Oct 23, 2012 at 12:53
  • 1
    Not sure yet - so you want your result as a list like [["quick", "brown", "fox"], ["quick", "brown"], ["fox", "quick"]]? What if a line doesn't match at all? Do you want the match or the regex in this list (here they are identical but what about a regex like qu\w*ck)? Commented Oct 23, 2012 at 12:58
  • @TimPietzcker the output you have suggested is right. i need the regex in the list not the match. if there is no match i will record '' (null string) sorry for the confusion created! Commented Oct 23, 2012 at 13:09

2 Answers 2

7

The approach below is in the case that you want the matches. In the case that you need the regular expression in a list that triggered a match, you are out of luck and will probably need to loop.

Based on the link you have provided:

import re
regexes= 'quick', 'brown', 'fox'
combinedRegex = re.compile('|'.join('(?:{0})'.format(x) for x in regexes))

lines = 'The quick brown fox jumps over the lazy dog', 'Lorem ipsum dolor sit amet', 'The lazy dog jumps over the fox'

for line in lines:
    print combinedRegex.findall(line)

outputs:

['quick', 'brown', 'fox']
[]
['fox']

The point here is that you do not loop over the regex but combine them. The difference with the looping approach is that re.findall will not find overlapping matches. For instance if your regexes were: regexes= 'bro', 'own', the output of the lines above would be:

['bro']
[]
[]

whereas the looping approach would result in:

['bro', 'own']
[]
[]
Sign up to request clarification or add additional context in comments.

1 Comment

i will try this and respond back here.
1

If you're just trying to match literal strings, it's probably easier to just do:

strings = 'foo','bar','baz','qux'
regex = re.compile('|'.join(re.escape(x) for x in strings))

and then you can test the whole thing at once:

match = regex.match(line)

Of course, you can get the string which matched from the resulting MatchObject:

if match:
    matching_string = match.group(0)

In action:

import re
strings = 'foo','bar','baz','qux'
regex = re.compile('|'.join(re.escape(x) for x in strings))

lines = 'foo is a word I know', 'baz is a  word I know', 'buz is unfamiliar to me'

for line in lines:
    match = regex.match(line)
    if match:
        print match.group(0)

It appears that you're really looking to search the string for your regex. In this case, you'll need to use re.search (or some variant), not re.match no matter what you do. As long as none of your regular expressions overlap, you can use my above posted solution with re.findall:

matches = regex.findall(line)
for word in matches:
    print ("found {word} in line".format(word=word))

6 Comments

reading comprehension... he needs a True/False result for each individual regex.
@l4mpi -- I don't think so -- the question states "How do I record the (singular, emphasis mine) matching regex from the line" ... You can figure out which regex matched from the corresponding MatchObject (e.g. match.group(0))
from OP: "I need to record that string which appears in the line separately."
for example, my regexes are ['quick', 'brown', 'fox'] and i have a line "the jumping brown dog scared the quick fox away" here i need to be able to record the words quick,brown and fox since all three are present in the line. a simple compile and match will not help me here. right?
@mgilson this doesn't handle multiple words in the same line, e.g. "foo bar baz" prints "foo" - not sure if OP needs this though (edit: he does^^).
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.