text:
- MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL EMPLOYEE; KNOWLEDGE, RESPONSIBILITIES AND REPORTING - Comments: NO WRITTEN EMPLOYEE HEALTH POLICY ON SITE. MUST PROVIDE FOR ALL EMPLOYEES. PRIORITY FOUNDATION VIOLATION 7-38-010 CITATION ISSUED. | 5. PROCEDURES FOR RESPONDING TO VOMITING AND DIARRHEAL EVENTS - Comments: NO WRITTEN CLEANING PROCEDURE OR REQUIRED EQUIPMENT FOR A VOMITING/DIARRHEA EVENT. MUST PROVIDE. PRIORITY FOUNDATION VIOLATION 7-38-005 CITATION ISSUED. | 25. CONSUMER ADVISORY PROVIDED FOR RAW/UNDERCOOKED FOOD - Comments: MENU DOES NOT DISCLOSE AND INFORM CONSUMERS THE SPECIFIC MENU ITEMS THAT ARE RAW OR UNDER COOKED AND A POTENTIAL HAZARD OF CONSUMING SUCH FOOD. MUST PROVIDE A CONSUMER ADVISORY THAT DISCLOSES AND REMINDS CUSTOMERS OF SUCH ITEMS. PRIORITY FOUNDATION VIOLATION. NO CITATION ISSUED. | 38.
Question:
The sections of the text include sections 3, 5, 25, and 38 (followed by starting index). I want to extract all texts from one section after '- Comments:' and before the starting index of the next section.
def comments(x):
result = []
for elem in df['Violations']:
matches = re.findall(r'\d+\. (.*?)(?: - |\r?\n|$)', elem)
result.extend(matches)
print(result)
The attached code is doing the totally opposite extraction which only extracts the words before '- Comments:', how can I change it?
Many thanks
Commentsthen useCommentsin regex. Maybe"Comments: (.+) \|or'Comments: ([^\|]*) \|'