1

I am using Python 3.7 and I am trying to extract some paragraph from some text files using regex.

Here is a sample of the txt file content.

AREA: OMBEYI MARKET, ST. RITA RAMULA

DATE: Thursday 25.03.2021, TIME: 9.00 A.M. = 5.00 P.M.

Ombeyi Mk, Kiliti Mkt, Masogo Mkt, Miwani, Kasongo, Onyango Midika, St. Rita Ramula, Onyalo
Biro, Yawo Pri, Obino, Rutek, Keyo Pri & adjacent customers.

AREA: NYAMACHE FACTORY

DATE: Thursday 25.03.2021, TIME: 830 A.M. - 3.00 P.M.

Nyamache Fact, Suguta, Gionseri, Igare, Kionduso, Nyationgongo, Enchoro, Kebuko, Emenwa, Maji
Mazuri, Borangi & adjacent customers.

AREA: SUNEKA MARKET, RIANA MARKET

DATE: Thursday 25.03.2021, TIME: 8.00 A.M. - 3.00 P.M.

Suneka Mk, Riana Mk, Kiabusura, Gesonso, Chisaro, Sugunana, Nyamira Ndogo & adjacent
customers.

AREA: ITIATI, GITUNDUTI

DATE: Thursday 25.03.2021, TIME: 9.00 A.M. = 2.00 P.M.

General China, Gachuiro, Gathuini Pri, Itiati Campus, Kianjugum, Gikore, Kihuri TBC, Gitunduti &
adjacent customers.

Currently I am able to extract the Area, Date and Time using regex:

area_pattern = re.compile("^AREA:((.*))")
date_pattern = re.compile("^DATE:(.*),")
time_pattern = re.compile("TIME:(.*).")

I would like to be able to extract the paragraph after DATE/TIME and before AREA containing locations separated by commas. So I will be able to match the following:

1.
Ombeyi Mk, Kiliti Mkt, Masogo Mkt, Miwani, Kasongo, Onyango Midika, St. Rita Ramula, Onyalo
Biro, Yawo Pri, Obino, Rutek, Keyo Pri & adjacent customers.

2.
Nyamache Fact, Suguta, Gionseri, Igare, Kionduso, Nyationgongo, Enchoro, Kebuko, Emenwa, Maji
Mazuri, Borangi & adjacent customers.

3.
Suneka Mk, Riana Mk, Kiabusura, Gesonso, Chisaro, Sugunana, Nyamira Ndogo & adjacent
customers.

4.
General China, Gachuiro, Gathuini Pri, Itiati Campus, Kianjugum, Gikore, Kihuri TBC, Gitunduti &
adjacent customers.

If anyone could help with suggesting a regex that would help with this use case, as well as improvements to my current regex, I would really appreciate it. Thanks

2 Answers 2

3

You may use this regex with a capture group to be used in re.findall:

\nDATE:.*\n*((?:\n.*)+?)(?=\nAREA:|\Z)

RegEx Demo

RegEx Details:

  • \nDATE:: Match text DATE: after matching a line break
  • .*\n*: Match rest of the line followed by 0 or more line breaks
  • ((?:\n.*)+?): Capture group 1 to capture our text which 1 or lines of everything until next condition is satisfied
  • (?=\nAREA:|\Z): Assert that we have a line break followed by AREA: or end of input right ahead of the current position
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you so much. This worked and your regex details have given me a better understanding on how to go about similar use cases.
2

As an alternative pattern:

^DATE:.*((?:\n(?!AREA:).*)+)
  • ^DATE:.* Match DATE: and the rest of the line
  • ( Capture group 1
    • (?:\n(?!AREA:).*)+ Repeat 1+ lines that do not start with AREA:
  • ) Close group 1

Regex demo | Python demo

1 Comment

Thank you for this alternative pattern. It has worked!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.