0

I have a bunch of rows and each field in the row has a header identifying that field. Currently the file is just a csv, and although the first few fields would line up when put into excel, the rest of the row becomes misaligned due to some rows not having some of the fields or the fields being out of order. I am trying to make it so that each field will line up with the correct column header when copied into excel and using the "text to columns" tool. I'm sure this will mean padding places in the rows with the corresponding amount of commas to ensure that enough blank cells would be present to align that data field with the correct column.

Input:
id1,id2,id3,id4,id5,id6,id7,id8
id1 field1,id2 field2,id3 field3,id8 field8,id5 field5,id6 field6,id7 field7,id4 field4
id1 field1,id6 field6,id3 field3,id4 field4,id5 field5,id2 field2,id8 field8
id1 field1,id4 field4,id7 field7,id6 field6,id5 field5,id8 field8
id1 field1,id2 field2,id3 field3,id4 field4,id5 field5,id6 field6,id7 field7,id8 field8
id1 field1,id4 field4,id2 field2,id5 field5,id6 field6,id8 field8
id1 field1,id2 field2,id8 field8,id4 field4,id5 field5,id6 field6,id7 field7,id3 field3

Output:
id1,id2,id3,id4,id5,id6,id7,id8
id1 field1,id2 field2,id3 field3,id4 field4,id5 field5,id6 field6,id7 field7,id8 field8
id1 field1,id2 field2,id3 field3,id4 field4,id5 field5,id6 field6,,id8 field8
id1 field1,,,id4 field4,id5 field5,id6 field6,id7 field7,id8 field8
id1 field1,id2 field2,id3 field3,id4 field4,id5 field5,id6 field6,id7 field7,id8 field8
id1 field1,id2 field2,,id4 field4,id5 field5,id6 field6,,id8 field8
id1 field1,id2 field2,id3 field3,id4 field4,id5 field5,,id7 field7,id8 field8

Basically i'm trying to reorder the rows based on the header row, then pad with extra commas where the field that should exist doesn't exist in that particular row. Each field has a label preceeding the actual data, which corresponds to the header that field should be under.

I can't find anything on google, and I'm not sure how to do this. Sorry, can't be anymore specific.


New Data set run with awk:

Input: 
id1,id2,id3,id4
id1.100 "field1",id2.100 "field2",id3.100 "field3",id4.100 "field4"
id1.101 "field1",id2.101 "field2",id3.101 "field3",id4.101 "field4"
id1.102 "field1",id2.102 "field2",id3.102 "field3",id4.102 "field4"
id1.103 "field1",id2.103 "field2",id3.103 "field3",id4.103 "field4"

output:
id1,id2,id3,id4
,,,
,,,
,,,
,,,

Not sure why its doing this. The new data set does have "/" ":" "(" characters inside the quotes at each field. The number after the "." in the id part changes between each data set that I would push through this script.

I did just try this:

Input: 
id1.100,id2.100,id3.100,id4.100
id1.100 "field1",id2.100 "field2",id3.100 "field3",id4.100 "field4"
id1.101 "field1",id2.101 "field2",id3.101 "field3",id4.101 "field4"
id1.102 "field1",id2.102 "field2",id3.102 "field3",id4.102 "field4"
id1.103 "field1",id2.103 "field2",id3.103 "field3",id4.103 "field4"

output:
id1,id2,id3,id4
id1.100 "field1",id2.100 "field2",id3.100 "field3",id4.100 "field4"
,,,
,,,
,,,

So is there a way to identify the id field by only the beginning? like if the id field was Name.105 to only identify it by the "name" string?


Repeating fields in data set:

Input: 
id1.100,id2.100,id3.100,id4.100
id1.100 "field1",id2.100 "field2",id3.100 "field3",id3.100 "field3",id2.100 "field2"
id1.101 "field1",id2.101 "field2",id2.101 "field2",id3.101 "field3",id3.101 "field3"
id1.102 "field1",id2.102 "field2",id3.102 "field3",id4.103 "field4",id1.102 "field1"

Output:
id1.100,id2.100,id3.100,id4.100
id1.100 "field1",id2.100 "field2",id3.100 "field3",
id1.101 "field1",id2.101 "field2",id3.101 "field3",
id1.102 "field1",id2.102 "field2",id3.102 "field3",id4.103 "field4"
2
  • I think you'll probably end up using Awk (or perhaps Perl or Python). You seem to want the fields reordered between input and output on each line, and empty fields created when there is no input data for a given field ID in a line. And the list of valid field IDs is given on line 1 of the input file. What should happen if an invalid ID is found, or there are two or more occurrences of a single ID on a line? Commented Jan 27, 2020 at 22:29
  • All field IDs are correct, I do not input them manually. And the field IDs only ever appear once per line. I don't know too much about awk, I use sed more. I'll definitely read into awk more to see if I can think of something. Commented Jan 27, 2020 at 23:38

1 Answer 1

1

Assuming:

  • The ids are arbitrary strings and general sorting (such as dictionary order) will not work.
  • The id and the field do not contain whitespaces and are separated by a whitespace.

Then how about:

declare -A id2val               # associative array to store id and fields
while IFS=, read -ra f; do
    if ((nr++ == 0)); then      # header line
        ids=("${f[@]}")         # assign ids in order
        (IFS=,; echo "${ids[*]}")
                                # print the header
    else
        id2val=()               # empty the associative array
        for ((i=0; i<${#f[@]}; i++)); do
                                # process each field of the input line
            id=${f[i]% *}       # extract the substring before space
            val=${f[i]#* }      # extract the substring after space
            id2val[$id]="$val"  # associate field value with the id
        done
        for ((i=0; i<${#ids[@]}; i++)); do
                                # process in the "id" order
            id=${ids[i]}        # retrieve the id
            if [[ -n ${id2val[$id]} ]]; then
                                # found the field associated to the id
                list[i]="$id ${id2val[$id]}"
                                # then format the csv field as output
            else
                list[i]=""
            fi
        done
        (IFS=,; echo "${list[*]}")
                                # print the record as csv
    fi
done < input.csv

Output:

id1,id2,id3,id4,id5,id6,id7,id8
id1 field1,id2 field2,id3 field3,id4 field4,id5 field5,id6 field6,id7 field7,id8 field8
id1 field1,id2 field2,id3 field3,id4 field4,id5 field5,id6 field6,,id8 field8
id1 field1,,,id4 field4,id5 field5,id6 field6,id7 field7,id8 field8
id1 field1,id2 field2,id3 field3,id4 field4,id5 field5,id6 field6,id7 field7,id8 field8
id1 field1,id2 field2,,id4 field4,id5 field5,id6 field6,,id8 field8
id1 field1,id2 field2,id3 field3,id4 field4,id5 field5,id6 field6,id7 field7,id8 field8

When processing each record, it first splits each column on whitespace into id and field value, then store them in an associative array. Next it loops over the ids in the header-defined order; if the field value associated to the id is found, then fill the fileld to output.

Without the associative array, we would need to create a double loop to match the ids, which will be inefficient.

If awk is your option, you can also say:

awk 'BEGIN {FS=OFS=","}
    NR==1 {
        for (i=1; i<=NF; i++) ids[i] = $i
        nf = NF
        print
        next
    }
    {
        delete id2val
        for (i=1; i<=NF; i++) {
            split($i, a, " ")
            id2val[a[1]] = a[2]
        }
        for (i=1; i<=nf; i++) {
            id = ids[i]
            $i = (id2val[id] != "") ? id " " id2val[id] : ""
        }
        print
    }
' input.csv

which will be more efficient than the bash solution.

[UPDATE]
Modified to meet the new data set provided by the OP.

The new data fails because the original awk script expects the IDs in the header line and the IDs left to the whitespace and the "field" are in the same format without treating the dot as a special meaning.
Would you please try the following instead:

awk 'BEGIN {FS=OFS=","}
    NR==1 {
        print
        for (i=1; i<=NF; i++) {
            sub("\\..*", "", $i) # remove the suffix, if any
            ids[i] = $i
        }
        nf = NF
        next
    }
    {
        delete id2val
        for (i=1; i<=NF; i++) {
            split($i, a, " ")
            split(a[1], b, ".") # splits the id on "." if any
            id2id[b[1]] = a[1]  # maps "id1" to "id1.100" e.g.
            id2val[b[1]] = a[2] # maps "id1" to "field1" e.g.
        }
        for (i=1; i<=nf; i++) {
            id = ids[i]
            $i = (id2val[id] != "") ? id2id[id] " " id2val[id] : ""
        }
        NF = nf                 # adjust the NF to "print" properly
        print
    }
' input.csv

I have modified the awk script to split the IDs on the dot and introduced a variable id2id to retrieve the original (including a dot and numbers) ID string.

The updated script is compatible with the original data set which IDs do not include dots and will work regardless of the characters such as /.( in the fields.

Sign up to request clarification or add additional context in comments.

11 Comments

Quick question, does the awk script work if the id field has a small random number attached to it? for example: id1.22 "field1",id2.58 "field2",id3.92 "field3" script runs fine for my original application, but I'm just trying to reutilize it for another document of very similar format. I'm not getting any errors with the new doc, its just not putting things in order or padding commas.
@ehammer It should work because the awk script above treats only a comma and a whitespace for a special purpose (field separator) and other characters including a dot and numbers are part of a plain string. If you still face a problem or a question, please feel free to ask. BR.
Its very strange. I copy-pasted the awk command directly from my original script (it still works perfectly) but when I run it on the new set of data, it prints the header, then the first line (next to it, as in, all on one line) then for every missing line it just pads with commas. Do you have any recommendations for troubleshooting ?
Is it possible to update your question with your new data? The whole data set is not necessary. A minimal set of data which reproduces the problem is preferred. BTW my next reply may be tomorrow as it's high time in my TZ. BR.
I will try to give a good replication of my data below my original question above.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.