python regex - select words after pattern

Question

text:

MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL EMPLOYEE; KNOWLEDGE, RESPONSIBILITIES AND REPORTING - Comments: NO WRITTEN EMPLOYEE HEALTH POLICY ON SITE. MUST PROVIDE FOR ALL EMPLOYEES. PRIORITY FOUNDATION VIOLATION 7-38-010 CITATION ISSUED. | 5. PROCEDURES FOR RESPONDING TO VOMITING AND DIARRHEAL EVENTS - Comments: NO WRITTEN CLEANING PROCEDURE OR REQUIRED EQUIPMENT FOR A VOMITING/DIARRHEA EVENT. MUST PROVIDE. PRIORITY FOUNDATION VIOLATION 7-38-005 CITATION ISSUED. | 25. CONSUMER ADVISORY PROVIDED FOR RAW/UNDERCOOKED FOOD - Comments: MENU DOES NOT DISCLOSE AND INFORM CONSUMERS THE SPECIFIC MENU ITEMS THAT ARE RAW OR UNDER COOKED AND A POTENTIAL HAZARD OF CONSUMING SUCH FOOD. MUST PROVIDE A CONSUMER ADVISORY THAT DISCLOSES AND REMINDS CUSTOMERS OF SUCH ITEMS. PRIORITY FOUNDATION VIOLATION. NO CITATION ISSUED. | 38.

Question:

The sections of the text include sections 3, 5, 25, and 38 (followed by starting index). I want to extract all texts from one section after '- Comments:' and before the starting index of the next section.

def comments(x):
    result = []
    for elem in df['Violations']:
        matches = re.findall(r'\d+\. (.*?)(?: - |\r?\n|$)', elem)
        result.extend(matches)
    print(result)

The attached code is doing the totally opposite extraction which only extracts the words before '- Comments:', how can I change it?

Many thanks

if you want after Comments then use Comments in regex. Maybe "Comments: (.+) \| or 'Comments: ([^\|]*) \|' — furas
– furas, Commented Oct 28, 2021 at 19:23
sorry, I am very new to regex. Could you write the full code in re.findall( )? Thanks a lot! — Yumeng Xu
– Yumeng Xu, Commented Oct 28, 2021 at 19:28

furas · Accepted Answer · 2021-10-28 19:36:53Z

If you want text between Comments: and | then use these values in regex.

'Comments: ([^\|]*) \|'

It uses () to catch only chars between Comments: and | but different then | (see [^\|]).

Because | has special meaning in regex so I use \| to use it as normal char in text.

Or

'Comments: (.*?) \|'

which uses ? to get chars different then |

import re

elem = '''MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL EMPLOYEE; KNOWLEDGE, RESPONSIBILITIES AND REPORTING - Comments: NO WRITTEN EMPLOYEE HEALTH POLICY ON SITE. MUST PROVIDE FOR ALL EMPLOYEES. PRIORITY FOUNDATION VIOLATION 7-38-010 CITATION ISSUED. | 5. PROCEDURES FOR RESPONDING TO VOMITING AND DIARRHEAL EVENTS - Comments: NO WRITTEN CLEANING PROCEDURE OR REQUIRED EQUIPMENT FOR A VOMITING/DIARRHEA EVENT. MUST PROVIDE. PRIORITY FOUNDATION VIOLATION 7-38-005 CITATION ISSUED. | 25. CONSUMER ADVISORY PROVIDED FOR RAW/UNDERCOOKED FOOD - Comments: MENU DOES NOT DISCLOSE AND INFORM CONSUMERS THE SPECIFIC MENU ITEMS THAT ARE RAW OR UNDER COOKED AND A POTENTIAL HAZARD OF CONSUMING SUCH FOOD. MUST PROVIDE A CONSUMER ADVISORY THAT DISCLOSES AND REMINDS CUSTOMERS OF SUCH ITEMS. PRIORITY FOUNDATION VIOLATION. NO CITATION ISSUED. | 38.'''

#matches = re.findall('Comments: ([^\|]*) \|', elem)
matches = re.findall('Comments: (.*?) \|', elem)

#print(matches)

for item in matches:
    print(item)
    print('---')

Result:

NO WRITTEN EMPLOYEE HEALTH POLICY ON SITE. MUST PROVIDE FOR ALL EMPLOYEES. PRIORITY FOUNDATION VIOLATION 7-38-010 CITATION ISSUED.
---
NO WRITTEN CLEANING PROCEDURE OR REQUIRED EQUIPMENT FOR A VOMITING/DIARRHEA EVENT. MUST PROVIDE. PRIORITY FOUNDATION VIOLATION 7-38-005 CITATION ISSUED.
---
MENU DOES NOT DISCLOSE AND INFORM CONSUMERS THE SPECIFIC MENU ITEMS THAT ARE RAW OR UNDER COOKED AND A POTENTIAL HAZARD OF CONSUMING SUCH FOOD. MUST PROVIDE A CONSUMER ADVISORY THAT DISCLOSES AND REMINDS CUSTOMERS OF SUCH ITEMS. PRIORITY FOUNDATION VIOLATION. NO CITATION ISSUED.

The fourth bird · Accepted Answer · 2021-10-28 21:47:52Z

Your pattern captures as least as possible text in a group before either - , a newline or end of string and does not match any part with Comments:

You could change it by matching comments, and add a capture group for the text after it

\d+\. .*?(?: - Comments:\s*)(.*?)(?: \||$)

Regex demo

A bit more precise match could be matching the start of each text, which is digits, a dot and a space, and then match until the first occurrence of -Comments: without crossing the start of another text.

That after Comments, you can use a capture group to capture until the next occurrence of a text, or assert the end of the string if it is the last one.

Using re.findall will return the value of capture group 1.

\b\d+\. (?:(?!\d+\. |- Comments:).)*- Comments:\s*(.*?)(?: \||$)

The pattern matches:

\b A word boundary to prevent a partial word match
\d+\. Match 1+ digits, a dot and space
(?:(?!\d+\. |- Comments:).)* Match any char if directly to the right there is not pattern \d+\. or - Comments
- Comments:\s* Match - Comments: followed by optional whitespace chars
(.*?) Capture group 1, match any char as least as possible
(?: \||$) Match either

Regex demo | Python demo

Example

import re

regex = r"\b\d+\. (?:(?!\d+\. |- Comments:).)*- Comments:\s*(.*?)(?: \||$)"

s = "3. MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL EMPLOYEE; KNOWLEDGE, RESPONSIBILITIES AND REPORTING - Comments: NO WRITTEN EMPLOYEE HEALTH POLICY ON SITE. MUST PROVIDE FOR ALL EMPLOYEES. PRIORITY FOUNDATION VIOLATION 7-38-010 CITATION ISSUED. | 5. PROCEDURES FOR RESPONDING TO VOMITING AND DIARRHEAL EVENTS - Comments: NO WRITTEN CLEANING PROCEDURE OR REQUIRED EQUIPMENT FOR A VOMITING/DIARRHEA EVENT. MUST PROVIDE. PRIORITY FOUNDATION VIOLATION 7-38-005 CITATION ISSUED. | 25. CONSUMER ADVISORY PROVIDED FOR RAW/UNDERCOOKED FOOD - Comments: MENU DOES NOT DISCLOSE AND INFORM CONSUMERS THE SPECIFIC MENU ITEMS THAT ARE RAW OR UNDER COOKED AND A POTENTIAL HAZARD OF CONSUMING SUCH FOOD. MUST PROVIDE A CONSUMER ADVISORY THAT DISCLOSES AND REMINDS CUSTOMERS OF SUCH ITEMS. PRIORITY FOUNDATION VIOLATION. NO CITATION ISSUED.  | 38. "

print(re.findall(regex, s))

Output

[
'NO WRITTEN EMPLOYEE HEALTH POLICY ON SITE. MUST PROVIDE FOR ALL EMPLOYEES. PRIORITY FOUNDATION VIOLATION 7-38-010 CITATION ISSUED.', 
'NO WRITTEN CLEANING PROCEDURE OR REQUIRED EQUIPMENT FOR A VOMITING/DIARRHEA EVENT. MUST PROVIDE. PRIORITY FOUNDATION VIOLATION 7-38-005 CITATION ISSUED.', 
'MENU DOES NOT DISCLOSE AND INFORM CONSUMERS THE SPECIFIC MENU ITEMS THAT ARE RAW OR UNDER COOKED AND A POTENTIAL HAZARD OF CONSUMING SUCH FOOD. MUST PROVIDE A CONSUMER ADVISORY THAT DISCLOSES AND REMINDS CUSTOMERS OF SUCH ITEMS. PRIORITY FOUNDATION VIOLATION. NO CITATION ISSUED. '
]

Collectives™ on Stack Overflow

python regex - select words after pattern

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related