I have a bunch of rows and each field in the row has a header identifying that field. Currently the file is just a csv, and although the first few fields would line up when put into excel, the rest of the row becomes misaligned due to some rows not having some of the fields or the fields being out of order. I am trying to make it so that each field will line up with the correct column header when copied into excel and using the "text to columns" tool. I'm sure this will mean padding places in the rows with the corresponding amount of commas to ensure that enough blank cells would be present to align that data field with the correct column.
Input:
id1,id2,id3,id4,id5,id6,id7,id8
id1 field1,id2 field2,id3 field3,id8 field8,id5 field5,id6 field6,id7 field7,id4 field4
id1 field1,id6 field6,id3 field3,id4 field4,id5 field5,id2 field2,id8 field8
id1 field1,id4 field4,id7 field7,id6 field6,id5 field5,id8 field8
id1 field1,id2 field2,id3 field3,id4 field4,id5 field5,id6 field6,id7 field7,id8 field8
id1 field1,id4 field4,id2 field2,id5 field5,id6 field6,id8 field8
id1 field1,id2 field2,id8 field8,id4 field4,id5 field5,id6 field6,id7 field7,id3 field3
Output:
id1,id2,id3,id4,id5,id6,id7,id8
id1 field1,id2 field2,id3 field3,id4 field4,id5 field5,id6 field6,id7 field7,id8 field8
id1 field1,id2 field2,id3 field3,id4 field4,id5 field5,id6 field6,,id8 field8
id1 field1,,,id4 field4,id5 field5,id6 field6,id7 field7,id8 field8
id1 field1,id2 field2,id3 field3,id4 field4,id5 field5,id6 field6,id7 field7,id8 field8
id1 field1,id2 field2,,id4 field4,id5 field5,id6 field6,,id8 field8
id1 field1,id2 field2,id3 field3,id4 field4,id5 field5,,id7 field7,id8 field8
Basically i'm trying to reorder the rows based on the header row, then pad with extra commas where the field that should exist doesn't exist in that particular row. Each field has a label preceeding the actual data, which corresponds to the header that field should be under.
I can't find anything on google, and I'm not sure how to do this. Sorry, can't be anymore specific.
New Data set run with awk:
Input:
id1,id2,id3,id4
id1.100 "field1",id2.100 "field2",id3.100 "field3",id4.100 "field4"
id1.101 "field1",id2.101 "field2",id3.101 "field3",id4.101 "field4"
id1.102 "field1",id2.102 "field2",id3.102 "field3",id4.102 "field4"
id1.103 "field1",id2.103 "field2",id3.103 "field3",id4.103 "field4"
output:
id1,id2,id3,id4
,,,
,,,
,,,
,,,
Not sure why its doing this. The new data set does have "/" ":" "(" characters inside the quotes at each field. The number after the "." in the id part changes between each data set that I would push through this script.
I did just try this:
Input:
id1.100,id2.100,id3.100,id4.100
id1.100 "field1",id2.100 "field2",id3.100 "field3",id4.100 "field4"
id1.101 "field1",id2.101 "field2",id3.101 "field3",id4.101 "field4"
id1.102 "field1",id2.102 "field2",id3.102 "field3",id4.102 "field4"
id1.103 "field1",id2.103 "field2",id3.103 "field3",id4.103 "field4"
output:
id1,id2,id3,id4
id1.100 "field1",id2.100 "field2",id3.100 "field3",id4.100 "field4"
,,,
,,,
,,,
So is there a way to identify the id field by only the beginning? like if the id field was Name.105 to only identify it by the "name" string?
Repeating fields in data set:
Input:
id1.100,id2.100,id3.100,id4.100
id1.100 "field1",id2.100 "field2",id3.100 "field3",id3.100 "field3",id2.100 "field2"
id1.101 "field1",id2.101 "field2",id2.101 "field2",id3.101 "field3",id3.101 "field3"
id1.102 "field1",id2.102 "field2",id3.102 "field3",id4.103 "field4",id1.102 "field1"
Output:
id1.100,id2.100,id3.100,id4.100
id1.100 "field1",id2.100 "field2",id3.100 "field3",
id1.101 "field1",id2.101 "field2",id3.101 "field3",
id1.102 "field1",id2.102 "field2",id3.102 "field3",id4.103 "field4"