0

I have a file like this and i want to split the file in to multiple files based on a pattern. Each block has some information of a (Job Number =) with the first line having its parent information like this %HOSTNAME#PARENT_UNIQUE_ID_xxxxxx.JOB_NAME

I want extract the lines between %HOSTNAME#PARENT_UNIQUE_ID_xxxxxx.JOB_NAME including the line %HOSTNAME#PARENT_UNIQUE_ID_xxxxxx.JOB_NAME.

Here is what i'm doing, this is splitting files as needed like below ..

HOSTNAME#PARENT_UNIQUE_ID_000001.JOB_NAME_jobProperties.txt
HOSTNAME#PARENT_UNIQUE_ID_000002.JOB_NAME_jobProperties.txt

code

while IFS= read line ; do
        if [[ $line =~ "%sj" ]]; then
                job_prop_objct_name=$(echo $line | grep -o -P '(?<= ).*')
                echo $line > $job_prop_objct_name"_jobProperties.txt"
        else
                echo $line >> $job_prop_objct_name"_jobProperties.txt"
        fi
done < $1

But the problem is, in the text file sometimes there are multiple jobs (Job Number =), Example last two block in my text sample posted and my code is combining these in to one file.

What i would like is to split these blocks as well in to different files may be adding the job number to the file.

Text File

%sj HOSTNAME#PARENT_UNIQUE_ID_000001.JOB_NAME
General Information
Job = JOB_NAME
Workstation = HOSTNAME
Scheduled Time = 01/06/2018 06:00 TZ CST
Runtime Information
Status = Successful
Job Number = 12345
Time Information
Maximum Duration =
Extra Information
-
%sj HOSTNAME#PARENT_UNIQUE_ID_000002.JOB_NAME
General Information
Job = JOB_NAME
Workstation = HOSTNAME
Scheduled Time = 01/06/2018 06:00 TZ CST
Runtime Information
Status = Successful
Job Number = 12346
Time Information
Maximum Duration =
Extra Information
-
%sj HOSTNAME#PARENT_UNIQUE_ID_000003.JOB_NAME
General Information
Job = JOB_NAME
Workstation = HOSTNAME
Scheduled Time = 01/06/2018 06:00 TZ CST
Runtime Information
Status = Successful
Job Number = 12347
Time Information
Maximum Duration =
Extra Information
-
General Information
Job = JOB_NAME
Workstation = HOSTNAME
Scheduled Time = 01/06/2018 06:00 TZ CST
Runtime Information
Status = Successful
Job Number = 12348
Time Information
Maximum Duration =
Extra Information
-

The resultant files currently are looking like this..

HOSTNAME#PARENT_UNIQUE_ID_000001.JOB_NAME.txt

%sj HOSTNAME#PARENT_UNIQUE_ID_000001.JOB_NAME
General Information
Job = JOB_NAME
Workstation = HOSTNAME
Scheduled Time = 01/06/2018 06:00 TZ CST
Runtime Information
Status = Successful
Job Number = 12345
Time Information
Maximum Duration =
Extra Information
-

HOSTNAME#PARENT_UNIQUE_ID_000002.JOB_NAME.txt

%sj HOSTNAME#PARENT_UNIQUE_ID_000002.JOB_NAME
General Information
Job = JOB_NAME
Workstation = HOSTNAME
Scheduled Time = 01/06/2018 06:00 TZ CST
Runtime Information
Status = Successful
Job Number = 12346
Time Information
Maximum Duration =
Extra Information
-

HOSTNAME#PARENT_UNIQUE_ID_000003.JOB_NAME.txt

%sj HOSTNAME#PARENT_UNIQUE_ID_000003.JOB_NAME
General Information
Job = JOB_NAME
Workstation = HOSTNAME
Scheduled Time = 01/06/2018 06:00 TZ CST
Runtime Information
Status = Successful
Job Number = 12347
Time Information
Maximum Duration =
Extra Information
-
General Information
Job = JOB_NAME
Workstation = HOSTNAME
Scheduled Time = 01/06/2018 06:00 TZ CST
Runtime Information
Status = Successful
Job Number = 12348
Time Information
Maximum Duration =
Extra Information
-

I want the file HOSTNAME#PARENT_UNIQUE_ID_000003.JOB_NAME.txt to split in to multiple files depending on the job numbers it has like this in this example ..

HOSTNAME#PARENT_UNIQUE_ID_000003.JOB_NAME_12347.txt

%sj HOSTNAME#PARENT_UNIQUE_ID_000003.JOB_NAME
General Information
Job = JOB_NAME
Workstation = HOSTNAME
Scheduled Time = 01/06/2018 06:00 TZ CST
Runtime Information
Status = Successful
Job Number = 12347
Time Information
Maximum Duration =
Extra Information
-

HOSTNAME#PARENT_UNIQUE_ID_000003.JOB_NAME_12348.txt

%sj HOSTNAME#PARENT_UNIQUE_ID_000003.JOB_NAME
General Information
Job = JOB_NAME
Workstation = HOSTNAME
Scheduled Time = 01/06/2018 06:00 TZ CST
Runtime Information
Status = Successful
Job Number = 12348
Time Information
Maximum Duration =
Extra Information
-

UPDATE:- Workaround, although not a complete solution. .
This is closest I could get as a workaround with a caveat, and i'm sure it is the ugly way.

split_JobPropsFile () {
counter=1
while IFS= read line ; do
if [[ $line =~ "%sj" ]]; then
        job_prop_objct_name=$(echo $line | grep -o -P '(?<= ).*')
        echo $line > $job_prop_objct_name"_"$counter"_jobProperties.txt"
else
        echo $line >> $job_prop_objct_name"_"$counter"_jobProperties.txt"
                if [[ $line =~ "-" ]]; then
                ((counter++))
                #echo "End of Block"
                echo "%sj" $job_prop_objct_name >> $job_prop_objct_name"_"$counter"_jobProperties.txt"
                fi
fi
done < $1
}

The above code is doing what i'm expecting. Except, it is creating one extra file at the end of loop with just the "%sj" line.

Of course, it is probably not an intelligent way to achieve this and it is also time consuming when my input file is large and other issues i'm probably not aware of like open files etc ...

Can this be done using awk addressing the caveat of the extra file it is creating with this workaround ?

3
  • post the final content of the resulting files Commented Jan 6, 2018 at 18:42
  • @RomanPerekhrest , posted the current and desired results. Thank You ! Commented Jan 6, 2018 at 18:55
  • A bit too early Sunday to really grok this wall of text, but I don't immediately see anything csplit couldn't handle with a suitable (though perhaps not entirely trivial) regular expression. Commented Jan 7, 2018 at 7:52

1 Answer 1

1

I think you're looking for:

awk '/^%sj/   { prefix  = $2; content = "" } 
              { content = content "\n" $0        }
     /^Job N/ { close(fname); fname = prefix "_" $4 ".txt"   }
     /^-/     { print substr(content,2) > fname }
    ' MyTextFile
Sign up to request clarification or add additional context in comments.

8 Comments

Two issue with this 1) The files should have the first line of its parents reference for ex: %sj HOSTNAME#PARENT_UNIQUE_ID_000003.JOB_NAME . I have posted the sample desired results. 2) One of the files 12348_*.txt is having two blocks.
My bad; reading/typing on my phone. 1) drop the ";next" 2) change ">" to ">>", and ensure all such prefix files are initially rm-f'd .
everything looks good except the 12348_HOSTNAME_*.txt still has the additional block of job 12347
So: you expect to see a file like "HOSTNAME#PARENT_UNIQUE_ID_000003.JOB_NAME_12348.txt" which contains exactly one %sj section, for just that one job number, and the input file only contains one occurrences of Job Number = 12348 ?
Add close(fname) before fname = to avoid too many open files problem in non-gawk awks.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.