2

I have a text file containing colon separated lines such as the following:

OK-10:Jason:Jones:ID No:00000000:male:my notes                                                                                                                                                       
OK-10:Mike:James:ID No:00000001:male:my notes OZ-09:John:Rick:ID No:00000002:male:my notes
OK-08:Michael:Knight:ID No:00000004:male:my notes2 OK-09:Helen:Rick:ID No:00000005:female:my notes3 OZ-10:Jane:James:ID No:00000034:female:my notes23 OK-09:Mary:Jane:ID No:00000023:female:my notes46

Note carefully that not all lines have the same number of terms. I want each line to appear like the first one, namely with seven terms only. For lines that run over, a new line should be formed. New line delimiter is O&- where & can be Z or K only. So the expected output from the above is:

OK-10:Jason:Jones:ID No:00000000:male:my notes                                                                                                                                                       
OK-10:Mike:James:ID No:00000001:male:my notes
OZ-09:John:Rick:ID No:00000002:male:my notes
OK-08:Michael:Knight:ID No:00000004:male:my notes2
OK-09:Helen:Rick:ID No:00000005:female:my notes3
OZ-10:Jane:James:ID No:00000034:female:my notes23
OK-09:Mary:Jane:ID No:00000023:female:my notes46

Can someone suggest a way of doing this using a text editing tool, regex, or maybe an application language such as (preferably) Batch script, Java or Python?

UPDATE

I tried using python and the regex code provided in the answer:

import csv import re

with open('form.csv') as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=',')
    for row in csv_reader:
        matches = re.findall(r'O[KZ]-\d+:(?:[^:]+:){5}.*?(?= O[KZ]|$)', row[29])
        print(matches)

But if a cell contains multiple entries like :

OK-10:Mike:James:ID No:00000001:male:my notes OZ-09:John:Rick:ID No:00000002:male:my notes

It returns only the first one of them.

1
  • 1
    I edited your question, which was still worded wrongly. I hope you will find assistance here +1. Commented Sep 3, 2019 at 11:03

3 Answers 3

1

Here is a regex based solution in Python which seems to work well:

with open('form.csv', 'r') as file:
    inp = file.read().replace('\n', '')

matches = re.findall(r'O[KZ]-\d+:(?:[^:]+:){5}.*?(?= O[KZ]|$)', inp)
print(matches)

This prints:

['OK-10:Mike:James:ID No:00000001:male:my notes',
 'OK-08:Michael:Knight:ID No:00000004:male:my notes2',
 'OK-09:Helen:Rick:ID No:00000005:female:my notes3',
 'OZ-10:Jane:James:ID No:00000034:female:my notes23',
 'OK-09:Mary:Jane:ID No:00000023:female:my notes46']

Here is a brief summary of how the regex pattern works:

O[KZ]-\d+:      match the first OK/OZ-number term
(?:[^:]+:){5}   then match the next five : terms
.*?(?= O[KZ]|$) finally match the remaining sixth term
                until seeing either OK/OZ or the end of the input

The output my script generates is a list, which you may then write back out to a text file, to later import into MySQL. Note that we read the entire file into a single string variable at the beginning. This is necessary to use this regex approach.

Sign up to request clarification or add additional context in comments.

4 Comments

Thank you, I have no idea in python so I have to search how to open the csv update it using your solution and save it in order to test it.
All you would have to do is read your text file into Python, use my script, and then write the list back out, one entry per line, that is all.
Yeap I am looking right now on some tutorials, time to get in touch with python :) I will come back when I test it. Thank you again for your help
@netdev If you want to use my answer, you will have to read the entire file into a single string variable. Iterating line by line won't work at all, q.v. my updated answer.
0

As simple as:

@echo off
setlocal EnableDelayedExpansion

for /F %%a in ('copy /Z "%~F0" NUL') do (set CRLF=%%a^
%Do not remove this line%
)

(for %%n in ("!CRLF!") do for /F "delims=" %%a in (input.txt) do (
   set "line=%%a"
   for %%d in (Z K) do set "line=!line: O%%d-=%%~nO%%d-!"
   echo(!line!
)) > output.txt

Comments

0

If you think you might have additional file manipulation tasks in the future that would benefit from a general purpose regex text processing utility, then you might consider JREPL.BAT. It is pure script (JScript/batch) that runs on any Windows machine from XP onward - no 3rd party exe file required.

jrepl "((?:[^:]*:){6}.*?) (?=O[KZ]-)" "$1\r\n" /xseq /f "yourFile.txt" /o -

Assuming O[KZ]- does not appear anywhere other than the beginning of each logical line, then you should be able to get away with this simpler regex:

jrepl "\s+(?=O[KZ]-)" "\r\n" /xseq /f "yourFile.txt" /o -

Full documentation is built into JREPL, available via jrepl /? or jrepl /?? for paged help. A summary of all options is available via jrepl /?options, and a summary of all types of help is available via jrepl /?help.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.