I have a text file containing colon separated lines such as the following:
OK-10:Jason:Jones:ID No:00000000:male:my notes
OK-10:Mike:James:ID No:00000001:male:my notes OZ-09:John:Rick:ID No:00000002:male:my notes
OK-08:Michael:Knight:ID No:00000004:male:my notes2 OK-09:Helen:Rick:ID No:00000005:female:my notes3 OZ-10:Jane:James:ID No:00000034:female:my notes23 OK-09:Mary:Jane:ID No:00000023:female:my notes46
Note carefully that not all lines have the same number of terms. I want each line to appear like the first one, namely with seven terms only. For lines that run over, a new line should be formed. New line delimiter is O&- where & can be Z or K only. So the expected output from the above is:
OK-10:Jason:Jones:ID No:00000000:male:my notes
OK-10:Mike:James:ID No:00000001:male:my notes
OZ-09:John:Rick:ID No:00000002:male:my notes
OK-08:Michael:Knight:ID No:00000004:male:my notes2
OK-09:Helen:Rick:ID No:00000005:female:my notes3
OZ-10:Jane:James:ID No:00000034:female:my notes23
OK-09:Mary:Jane:ID No:00000023:female:my notes46
Can someone suggest a way of doing this using a text editing tool, regex, or maybe an application language such as (preferably) Batch script, Java or Python?
UPDATE
I tried using python and the regex code provided in the answer:
import csv import re
with open('form.csv') as csv_file:
csv_reader = csv.reader(csv_file, delimiter=',')
for row in csv_reader:
matches = re.findall(r'O[KZ]-\d+:(?:[^:]+:){5}.*?(?= O[KZ]|$)', row[29])
print(matches)
But if a cell contains multiple entries like :
OK-10:Mike:James:ID No:00000001:male:my notes OZ-09:John:Rick:ID No:00000002:male:my notes
It returns only the first one of them.