unix parse a text file and split in to multiple files based on pattern

Question

I have a file like this and i want to split the file in to multiple files based on a pattern. Each block has some information of a (Job Number =) with the first line having its parent information like this %HOSTNAME#PARENT_UNIQUE_ID_xxxxxx.JOB_NAME

I want extract the lines between %HOSTNAME#PARENT_UNIQUE_ID_xxxxxx.JOB_NAME including the line %HOSTNAME#PARENT_UNIQUE_ID_xxxxxx.JOB_NAME.

Here is what i'm doing, this is splitting files as needed like below ..

HOSTNAME#PARENT_UNIQUE_ID_000001.JOB_NAME_jobProperties.txt
HOSTNAME#PARENT_UNIQUE_ID_000002.JOB_NAME_jobProperties.txt

code

while IFS= read line ; do
        if [[ $line =~ "%sj" ]]; then
                job_prop_objct_name=$(echo $line | grep -o -P '(?<= ).*')
                echo $line > $job_prop_objct_name"_jobProperties.txt"
        else
                echo $line >> $job_prop_objct_name"_jobProperties.txt"
        fi
done < $1

But the problem is, in the text file sometimes there are multiple jobs (Job Number =), Example last two block in my text sample posted and my code is combining these in to one file.

What i would like is to split these blocks as well in to different files may be adding the job number to the file.

Text File

%sj HOSTNAME#PARENT_UNIQUE_ID_000001.JOB_NAME
General Information
Job = JOB_NAME
Workstation = HOSTNAME
Scheduled Time = 01/06/2018 06:00 TZ CST
Runtime Information
Status = Successful
Job Number = 12345
Time Information
Maximum Duration =
Extra Information
-
%sj HOSTNAME#PARENT_UNIQUE_ID_000002.JOB_NAME
General Information
Job = JOB_NAME
Workstation = HOSTNAME
Scheduled Time = 01/06/2018 06:00 TZ CST
Runtime Information
Status = Successful
Job Number = 12346
Time Information
Maximum Duration =
Extra Information
-
%sj HOSTNAME#PARENT_UNIQUE_ID_000003.JOB_NAME
General Information
Job = JOB_NAME
Workstation = HOSTNAME
Scheduled Time = 01/06/2018 06:00 TZ CST
Runtime Information
Status = Successful
Job Number = 12347
Time Information
Maximum Duration =
Extra Information
-
General Information
Job = JOB_NAME
Workstation = HOSTNAME
Scheduled Time = 01/06/2018 06:00 TZ CST
Runtime Information
Status = Successful
Job Number = 12348
Time Information
Maximum Duration =
Extra Information
-

The resultant files currently are looking like this..

HOSTNAME#PARENT_UNIQUE_ID_000001.JOB_NAME.txt

%sj HOSTNAME#PARENT_UNIQUE_ID_000001.JOB_NAME
General Information
Job = JOB_NAME
Workstation = HOSTNAME
Scheduled Time = 01/06/2018 06:00 TZ CST
Runtime Information
Status = Successful
Job Number = 12345
Time Information
Maximum Duration =
Extra Information
-

HOSTNAME#PARENT_UNIQUE_ID_000002.JOB_NAME.txt

%sj HOSTNAME#PARENT_UNIQUE_ID_000002.JOB_NAME
General Information
Job = JOB_NAME
Workstation = HOSTNAME
Scheduled Time = 01/06/2018 06:00 TZ CST
Runtime Information
Status = Successful
Job Number = 12346
Time Information
Maximum Duration =
Extra Information
-

HOSTNAME#PARENT_UNIQUE_ID_000003.JOB_NAME.txt

%sj HOSTNAME#PARENT_UNIQUE_ID_000003.JOB_NAME
General Information
Job = JOB_NAME
Workstation = HOSTNAME
Scheduled Time = 01/06/2018 06:00 TZ CST
Runtime Information
Status = Successful
Job Number = 12347
Time Information
Maximum Duration =
Extra Information
-
General Information
Job = JOB_NAME
Workstation = HOSTNAME
Scheduled Time = 01/06/2018 06:00 TZ CST
Runtime Information
Status = Successful
Job Number = 12348
Time Information
Maximum Duration =
Extra Information
-

I want the file HOSTNAME#PARENT_UNIQUE_ID_000003.JOB_NAME.txt to split in to multiple files depending on the job numbers it has like this in this example ..

HOSTNAME#PARENT_UNIQUE_ID_000003.JOB_NAME_12347.txt

%sj HOSTNAME#PARENT_UNIQUE_ID_000003.JOB_NAME
General Information
Job = JOB_NAME
Workstation = HOSTNAME
Scheduled Time = 01/06/2018 06:00 TZ CST
Runtime Information
Status = Successful
Job Number = 12347
Time Information
Maximum Duration =
Extra Information
-

HOSTNAME#PARENT_UNIQUE_ID_000003.JOB_NAME_12348.txt

%sj HOSTNAME#PARENT_UNIQUE_ID_000003.JOB_NAME
General Information
Job = JOB_NAME
Workstation = HOSTNAME
Scheduled Time = 01/06/2018 06:00 TZ CST
Runtime Information
Status = Successful
Job Number = 12348
Time Information
Maximum Duration =
Extra Information
-

UPDATE:- Workaround, although not a complete solution. .
This is closest I could get as a workaround with a caveat, and i'm sure it is the ugly way.

split_JobPropsFile () {
counter=1
while IFS= read line ; do
if [[ $line =~ "%sj" ]]; then
        job_prop_objct_name=$(echo $line | grep -o -P '(?<= ).*')
        echo $line > $job_prop_objct_name"_"$counter"_jobProperties.txt"
else
        echo $line >> $job_prop_objct_name"_"$counter"_jobProperties.txt"
                if [[ $line =~ "-" ]]; then
                ((counter++))
                #echo "End of Block"
                echo "%sj" $job_prop_objct_name >> $job_prop_objct_name"_"$counter"_jobProperties.txt"
                fi
fi
done < $1
}

The above code is doing what i'm expecting. Except, it is creating one extra file at the end of loop with just the "%sj" line.

Of course, it is probably not an intelligent way to achieve this and it is also time consuming when my input file is large and other issues i'm probably not aware of like open files etc ...

Can this be done using awk addressing the caveat of the extra file it is creating with this workaround ?

@RomanPerekhrest , posted the current and desired results. Thank You ! — Kevin
– Kevin, Commented Jan 6, 2018 at 18:55
A bit too early Sunday to really grok this wall of text, but I don't immediately see anything csplit couldn't handle with a suitable (though perhaps not entirely trivial) regular expression. — tripleee
– tripleee, Commented Jan 7, 2018 at 7:52

Mischa · Accepted Answer · 2018-01-08 18:55:13Z

1

I think you're looking for:

awk '/^%sj/   { prefix  = $2; content = "" } 
              { content = content "\n" $0        }
     /^Job N/ { close(fname); fname = prefix "_" $4 ".txt"   }
     /^-/     { print substr(content,2) > fname }
    ' MyTextFile

edited Jan 8, 2018 at 18:55

answered Jan 6, 2018 at 20:48

Mischa

2,30820 silver badges19 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

Kevin Over a year ago

Two issue with this 1) The files should have the first line of its parents reference for ex: %sj HOSTNAME#PARENT_UNIQUE_ID_000003.JOB_NAME . I have posted the sample desired results. 2) One of the files 12348_*.txt is having two blocks.

Mischa Over a year ago

My bad; reading/typing on my phone. 1) drop the ";next" 2) change ">" to ">>", and ensure all such prefix files are initially rm-f'd .

Kevin Over a year ago

everything looks good except the 12348_HOSTNAME_*.txt still has the additional block of job 12347

Mischa Over a year ago

So: you expect to see a file like "HOSTNAME#PARENT_UNIQUE_ID_000003.JOB_NAME_12348.txt" which contains exactly one %sj section, for just that one job number, and the input file only contains one occurrences of Job Number = 12348 ?

Ed Morton Over a year ago

Add close(fname) before fname = to avoid too many open files problem in non-gawk awks.

|

Collectives™ on Stack Overflow

unix parse a text file and split in to multiple files based on pattern

1 Answer 1

8 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

8 Comments

Your Answer

Sign up or log in

Post as a guest

Related