1

I am parsing a series of text files for some patterns, since I want to extract them to other file.

A way to say it is that I would like to "remove" everything except the matches from the file.

For example, if I have pattern1, pattern2, pattern3 as matching patterns, I'd like the following input:

bla bla
pattern1
pattern2
bla bla bla
pattern1
pattern3
bla bla bla
pattern1

To give the following output:

pattern1
pattern2
pattern1
pattern3
pattern1

I can use re.findall and successfully get the list of matches for any pattern, but I cannot think of a way to KEEP THE ORDER considering the matches of each pattern are mixed inside the file.

Thanks for reading.

1
  • wrote a copy-and-go solution based on @richards solution Commented Aug 1, 2012 at 14:54

2 Answers 2

5

Combine it all into a single pattern. With your example code, use the pattern:

^pattern[0-9]+

If it's actually more complex, then try

^(aaaaa|bbbbb|ccccc|ddddd)
Sign up to request clarification or add additional context in comments.

8 Comments

i dont think this works for OP, he has multiple matches in his REGEX that he looks for, "pattern1,pattern2,etc" are examples... see my answer.
I'll accept this answer, that's what I wanted to do, just didn't know how or didn't remember how. The multiple patterns using | (OR) is the key to get in order, cause it says "give me any match of the following patterns", and the result will then come already in order.
oh - i see what you did there now, yes, the second "more complex" regex would work. but OP should still use file.writelines() itterating over the list that re.findall() returns.
just for the record, my unholy pattern is: '<p class="docText">.+?</p>|<h3 class="docSection1Title">.+?</h3>' ;o)
@heltonbiker Have you not seen stackoverflow.com/questions/1732348/…? Do not try to parse HTML with regular expressions.
|
2

here is an answer in "copy this and go" format.

import re

#lets you add more whenever you want
list_of_regex = [r"aaaa",r"bbbb",r"cccc"]

#hold the completed pattern
pattern_string = r"^("

#combines the patterns
for item in list_of_regex:
    pattern_string += "|".join(list_of_regex)

pattern_string += r")"

#open the file that you are reading
fr = open(FILE_TO_READ)

#holds the read files strings
search_string = fr.read()

#close the file
fr.close()

#open the file you want to write to
fw = open(FILE_TO_WRITE, 'w')

#write the results of findall into the file (as requested)
fw.writelines(re.findall(pattern_string,search_string))

#close the file
fw.close()

2 Comments

based off of @richards answer
I suspect re.compile could be of some help here, but I'd have to look more thoroughly. Thanks anyway!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.