Split string with variable number of occurances using an application language (Batch script preferably)

Question

I have a text file containing colon separated lines such as the following:

OK-10:Jason:Jones:ID No:00000000:male:my notes                                                                                                                                                       
OK-10:Mike:James:ID No:00000001:male:my notes OZ-09:John:Rick:ID No:00000002:male:my notes
OK-08:Michael:Knight:ID No:00000004:male:my notes2 OK-09:Helen:Rick:ID No:00000005:female:my notes3 OZ-10:Jane:James:ID No:00000034:female:my notes23 OK-09:Mary:Jane:ID No:00000023:female:my notes46

Note carefully that not all lines have the same number of terms. I want each line to appear like the first one, namely with seven terms only. For lines that run over, a new line should be formed. New line delimiter is O&- where & can be Z or K only. So the expected output from the above is:

OK-10:Jason:Jones:ID No:00000000:male:my notes                                                                                                                                                       
OK-10:Mike:James:ID No:00000001:male:my notes
OZ-09:John:Rick:ID No:00000002:male:my notes
OK-08:Michael:Knight:ID No:00000004:male:my notes2
OK-09:Helen:Rick:ID No:00000005:female:my notes3
OZ-10:Jane:James:ID No:00000034:female:my notes23
OK-09:Mary:Jane:ID No:00000023:female:my notes46

Can someone suggest a way of doing this using a text editing tool, regex, or maybe an application language such as (preferably) Batch script, Java or Python?

UPDATE

I tried using python and the regex code provided in the answer:

import csv import re

with open('form.csv') as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=',')
    for row in csv_reader:
        matches = re.findall(r'O[KZ]-\d+:(?:[^:]+:){5}.*?(?= O[KZ]|$)', row[29])
        print(matches)

But if a cell contains multiple entries like :

OK-10:Mike:James:ID No:00000001:male:my notes OZ-09:John:Rick:ID No:00000002:male:my notes

It returns only the first one of them.

I edited your question, which was still worded wrongly. I hope you will find assistance here +1. — Tim Biegeleisen
– Tim Biegeleisen, Commented Sep 3, 2019 at 11:03

Tim Biegeleisen · Accepted Answer · 2019-09-03 13:04:55Z

1

Here is a regex based solution in Python which seems to work well:

with open('form.csv', 'r') as file:
    inp = file.read().replace('\n', '')

matches = re.findall(r'O[KZ]-\d+:(?:[^:]+:){5}.*?(?= O[KZ]|$)', inp)
print(matches)

This prints:

['OK-10:Mike:James:ID No:00000001:male:my notes',
 'OK-08:Michael:Knight:ID No:00000004:male:my notes2',
 'OK-09:Helen:Rick:ID No:00000005:female:my notes3',
 'OZ-10:Jane:James:ID No:00000034:female:my notes23',
 'OK-09:Mary:Jane:ID No:00000023:female:my notes46']

Here is a brief summary of how the regex pattern works:

O[KZ]-\d+:      match the first OK/OZ-number term
(?:[^:]+:){5}   then match the next five : terms
.*?(?= O[KZ]|$) finally match the remaining sixth term
                until seeing either OK/OZ or the end of the input

The output my script generates is a list, which you may then write back out to a text file, to later import into MySQL. Note that we read the entire file into a single string variable at the beginning. This is necessary to use this regex approach.

edited Sep 3, 2019 at 13:04

answered Sep 3, 2019 at 11:10

Tim Biegeleisen

526k32 gold badges323 silver badges399 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

netdev Over a year ago

Thank you, I have no idea in python so I have to search how to open the csv update it using your solution and save it in order to test it.

Tim Biegeleisen Over a year ago

All you would have to do is read your text file into Python, use my script, and then write the list back out, one entry per line, that is all.

netdev Over a year ago

Yeap I am looking right now on some tutorials, time to get in touch with python :) I will come back when I test it. Thank you again for your help

Tim Biegeleisen Over a year ago

@netdev If you want to use my answer, you will have to read the entire file into a single string variable. Iterating line by line won't work at all, q.v. my updated answer.

Aacini · Accepted Answer · 2019-09-03 13:34:47Z

0

As simple as:

@echo off
setlocal EnableDelayedExpansion

for /F %%a in ('copy /Z "%~F0" NUL') do (set CRLF=%%a^
%Do not remove this line%
)

(for %%n in ("!CRLF!") do for /F "delims=" %%a in (input.txt) do (
   set "line=%%a"
   for %%d in (Z K) do set "line=!line: O%%d-=%%~nO%%d-!"
   echo(!line!
)) > output.txt

edited Sep 3, 2019 at 13:34

answered Sep 3, 2019 at 13:27

Aacini

67.7k12 gold badges74 silver badges114 bronze badges

Comments

dbenham · Accepted Answer · 2019-09-03 14:38:20Z

If you think you might have additional file manipulation tasks in the future that would benefit from a general purpose regex text processing utility, then you might consider JREPL.BAT. It is pure script (JScript/batch) that runs on any Windows machine from XP onward - no 3rd party exe file required.

jrepl "((?:[^:]*:){6}.*?) (?=O[KZ]-)" "$1\r\n" /xseq /f "yourFile.txt" /o -

Assuming O[KZ]- does not appear anywhere other than the beginning of each logical line, then you should be able to get away with this simpler regex:

jrepl "\s+(?=O[KZ]-)" "\r\n" /xseq /f "yourFile.txt" /o -

Full documentation is built into JREPL, available via jrepl /? or jrepl /?? for paged help. A summary of all options is available via jrepl /?options, and a summary of all types of help is available via jrepl /?help.

Collectives™ on Stack Overflow

Split string with variable number of occurances using an application language (Batch script preferably)

3 Answers 3

4 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related