Python extract paragraph in text file using regex

Question

I am using Python 3.7 and I am trying to extract some paragraph from some text files using regex.

Here is a sample of the txt file content.

AREA: OMBEYI MARKET, ST. RITA RAMULA

DATE: Thursday 25.03.2021, TIME: 9.00 A.M. = 5.00 P.M.

Ombeyi Mk, Kiliti Mkt, Masogo Mkt, Miwani, Kasongo, Onyango Midika, St. Rita Ramula, Onyalo
Biro, Yawo Pri, Obino, Rutek, Keyo Pri & adjacent customers.

AREA: NYAMACHE FACTORY

DATE: Thursday 25.03.2021, TIME: 830 A.M. - 3.00 P.M.

Nyamache Fact, Suguta, Gionseri, Igare, Kionduso, Nyationgongo, Enchoro, Kebuko, Emenwa, Maji
Mazuri, Borangi & adjacent customers.

AREA: SUNEKA MARKET, RIANA MARKET

DATE: Thursday 25.03.2021, TIME: 8.00 A.M. - 3.00 P.M.

Suneka Mk, Riana Mk, Kiabusura, Gesonso, Chisaro, Sugunana, Nyamira Ndogo & adjacent
customers.

AREA: ITIATI, GITUNDUTI

DATE: Thursday 25.03.2021, TIME: 9.00 A.M. = 2.00 P.M.

General China, Gachuiro, Gathuini Pri, Itiati Campus, Kianjugum, Gikore, Kihuri TBC, Gitunduti &
adjacent customers.

Currently I am able to extract the Area, Date and Time using regex:

area_pattern = re.compile("^AREA:((.*))")
date_pattern = re.compile("^DATE:(.*),")
time_pattern = re.compile("TIME:(.*).")

I would like to be able to extract the paragraph after DATE/TIME and before AREA containing locations separated by commas. So I will be able to match the following:

1.
Ombeyi Mk, Kiliti Mkt, Masogo Mkt, Miwani, Kasongo, Onyango Midika, St. Rita Ramula, Onyalo
Biro, Yawo Pri, Obino, Rutek, Keyo Pri & adjacent customers.

2.
Nyamache Fact, Suguta, Gionseri, Igare, Kionduso, Nyationgongo, Enchoro, Kebuko, Emenwa, Maji
Mazuri, Borangi & adjacent customers.

3.
Suneka Mk, Riana Mk, Kiabusura, Gesonso, Chisaro, Sugunana, Nyamira Ndogo & adjacent
customers.

4.
General China, Gachuiro, Gathuini Pri, Itiati Campus, Kianjugum, Gikore, Kihuri TBC, Gitunduti &
adjacent customers.

If anyone could help with suggesting a regex that would help with this use case, as well as improvements to my current regex, I would really appreciate it. Thanks

anubhava · Accepted Answer · 2021-03-31 13:53:49Z

3

You may use this regex with a capture group to be used in re.findall:

\nDATE:.*\n*((?:\n.*)+?)(?=\nAREA:|\Z)

RegEx Demo

RegEx Details:

\nDATE:: Match text DATE: after matching a line break
.*\n*: Match rest of the line followed by 0 or more line breaks
((?:\n.*)+?): Capture group 1 to capture our text which 1 or lines of everything until next condition is satisfied
(?=\nAREA:|\Z): Assert that we have a line break followed by AREA: or end of input right ahead of the current position

answered Mar 31, 2021 at 13:53

anubhava

790k67 gold badges603 silver badges671 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Kimkykie Over a year ago

Thank you so much. This worked and your regex details have given me a better understanding on how to go about similar use cases.

The fourth bird · Accepted Answer · 2021-03-31 15:25:52Z

2

As an alternative pattern:

^DATE:.*((?:\n(?!AREA:).*)+)

^DATE:.* Match DATE: and the rest of the line
( Capture group 1
- (?:\n(?!AREA:).*)+ Repeat 1+ lines that do not start with AREA:
) Close group 1

Regex demo | Python demo

answered Mar 31, 2021 at 15:25

The fourth bird

165k16 gold badges61 silver badges75 bronze badges

1 Comment

Kimkykie Over a year ago

Thank you for this alternative pattern. It has worked!

Collectives™ on Stack Overflow

Python extract paragraph in text file using regex

2 Answers 2

1 Comment

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related