4

I have a file with multiple lines which is structured as seen below

MSH|^~\&|Xatidok|V10.0.2.000|OSestra|x-tention|201203060855||ADT^A03|2914|P|2.3^AA&BB
EVN|A03|201203060855|201203060855|01|Fidani
PID|||00019380|2012049008^120005548^302830|PATIDOK-person^InRid^|Rudi|19111111|F|||Rose |A|Pens.
NK1||IRergrun^RROSlf^||Rose ^^Wels^^4600^A|07242123123|||||||||||||||||||||||||||||||
PV1||I|1212^G442^G442-||0|||||||||||2012049008|General|||||||||||||||||||12|||||201202060927|||||||

So basically there are rows with data on it seperated with pipes (|) and i want to parse them by writing a bash script.

So briefly this is the structure

  • Segment > rows
  • Field > cells between | field |
  • Component > each field has (or doesnt) several fields seperated with ^
  • Sub component > seperated with &

The idea of running the sript is: ./script.sh filename command

command should look like: MSH.2.3.4 or shorter

Meaning: Access the field which starts with MSH, Field number 2, Component number 3, Sub component 4

So my logic of parsing is as follows: I want to create an array which stores every row (segment) from the file as follows:

#!/bin/bash

file_to_be_parsed=$1
command=$2
counter=0

#read the file and split it into lines (segments) by creating an array called segments which holds all the lines (segment) in it
#array segments[] holds every line/segment of the file indexed from 0 to X

while IFS= read -a segment; do
     segments[$counter]=$segment
     counter=$((counter+1)); 
done < $file_to_be_parsed

SECOND: My second step is to seperate each array member one step further based on the delimiter and i can do it by:

IFS="|" read -r field <<< (here i can't figure out)

but i can't actually create 2D array in bash even though I searched a lot. Then i can access the specific fields ...

So can someone help me how to further seperate these array members into fields ...

10
  • Bash can't do nested data structures. A general-purpose programming language like Python would be better for this. Commented Oct 12, 2019 at 18:34
  • @wjandrea Yes, Python has dedicated parsing libraires but i have to do it in bash script. It's mandatory Commented Oct 12, 2019 at 18:36
  • 1
    please edit your Q to show the required output from your sample input. Good luck. Commented Oct 12, 2019 at 18:51
  • @Albion Hmm, so I think your best bet will be to avoid nested data structures, and just make an array for each selected field, i.e. for MSH.2.3.4, find the line that starts with MSH, then split it and select the second element, then split that, etc. Commented Oct 12, 2019 at 18:53
  • 1
    and look at the Awk Tutorial and check out the -F option (use as -F\| and then split(), using split(string,targArr,"^") (char to split by). Good luck. Commented Oct 12, 2019 at 18:53

2 Answers 2

4

This is a classic awk (standard Linux gawk) problem.

Here is a simple script that verify input arguments and parse only the required fields, component and subComponent using awk's internal split function.

The user is encouraged to simplify the script output layouts.

As for script's arguments, all are mandatory (some might be ignored), the input.txt file must be last.

input.txt

MSH|^~\&|Xatidok|V10.0.2.000|OSestra|x-tention|201203060855||ADT^A03|2914|P|2.3^AA&BB
EVN|A03|201203060855|201203060855|01|Fidani
PID|||00019380|2012049008^120005548^302830|PATIDOK-person^InRid^|Rudi|19111111|F|||Rose |A|Pens.
NK1||IRergrun^RROSlf^||Rose ^^Wels^^4600^A|07242123123|||||||||||||||||||||||||||||||
PV1||I|1212^G442^G442-||0|||||||||||2012049008|General|||||||||||||||||||12|||||201202060927|||||||

script.awk

BEGIN {FS="|"; componentSeperator="^"; subComponentSeperator="&"}
function readArgs() {
     if (passedReadArgs == 1) return;
     if (length(field) == 0) {print "Missing field string argument, exiting."; exit;}
     if (length(fieldNumber) == 0) {print "Missing fieldNumber number argument, exiting."; exit;}
     if (length(componentNumber) == 0) {print "Missing componentNumber number argument, exiting."; exit;}
     if (length(subComponentNumber) == 0) {print "Missing subComponentNumber number argument, exiting."; exit;}
     passedReadArgs = 1;
}
{
     readArgs();
     if ($0 !~ field) next;

     print "Arguments: "field, fieldNumber, componentNumber, subComponentNumber;

     print "field["fieldNumber"] = "$fieldNumber;

     split($fieldNumber, componentsArr, componentSeperator);
     if (length(componentsArr[componentNumber]) > 0) {
          print "component["componentNumber"] = "componentsArr[componentNumber];
          split(componentsArr[componentNumber], subComponentsArr, subComponentSeperator);
          if (length(subComponentsArr[subComponentNumber]) > 0) print "subComponent["subComponentNumber"] = "subComponentsArr[subComponentNumber];
     }
}

running the script.awk script:

awk -f script.awk field="MSH" fieldNumber=11 componentNumber=2 subComponentNumber=2 input.txt

output:

Arguments: MSH 12 2 2
field[12] = 2.3^AA&BB
component[2] = AA&BB
subComponent[2] = BB

Arguments: NK1 5 3 2
field[5] = Rose ^^Wels^^4600^A
component[3] = Wels


Arguments: PID 7 3 2
field[7] = Rudi
Sign up to request clarification or add additional context in comments.

Comments

2

Fr puer bash-only solution, can use bash arrays to split the line into fields, components, sub components. Provided that you do not have to run the code on large data sets, should be OK.

Considers switching to more powerful engine (awk, python, perl) for large problems.

#! /bin/bash
file=$1
command=$2
   # Split command into key, so that items are key[0], key[1], ...
IFS="." read -a k <<<"$command"

  # Look for matching line to k[0]
while IFS='|' read -a fa ; do
  # Skip to next row if no match.
  [ "${fa[0]}" = "${k[0]}" ] || continue ;
  # Field
  v=${fa[${k[1]}-1]}
  # Component
  if [ "${#k[@]}" -gt 2 ] ; then
      IFS="^" read -a fb <<<"$v"
      v=${fb[${k[2]}-1]}
  fi
  # Sub component
  if [ "${#k[@]}" -gt 3 ] ; then
      IFS="&" read -a fc <<<"$v"
      v=${fc[${k[3]}-1]}
  fi
  echo "V=$v" ;
  break
done <"$file"

4 Comments

Yes this works except for sub-component, as far as i tested it fails to split the last rule but in general it's amazing.
Can we use two conditions in IFS for example splitting lines with IFS="|" and IFS= but in the same while ?
@AlbionShala, minor fix to sub component logic entered.
@AlbionShala, IFS can be multiple characters.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.