1

I want to be able to sort an input csv file that is comma separated by a values created in an extra column. Below is a sample of the input csv file

Timestamp,Email,Name,Year,Make,Model,Car_ID,Judge_ID,Judge_Name,Racer_Turbo,Racer_Supercharged,Racer_Performance,Racer_Horsepower,Car_Overall,Engine_Modifications,Engine_Performance,Engine_Chrome,Engine_Detailing,Engine_Cleanliness,Body_Frame_Undercarriage,Body_Frame_Suspension,Body_Frame_Chrome,Body_Frame_Detailing,Body_Frame_Cleanliness,Mods_Paint,Mods_Body,Mods_Wrap,Mods_Rims,Mods_Interior,Mods_Other,Mods_ICE,Mods_Aftermarket,Mods_WIP,Mods_Overall
8/5/2018 14:10,[email protected],Hernando,2015,Acura,TLX,48,J04,Bob,0,0,2,2,4,4,0,2,4,4,2,4,2,2,2,2,2,0,4,4,4,6,2,0,4
8/5/2018 15:11,[email protected],Noel,2015,Jeep,Wrangler,124,J02,Carl,0,6,4,2,4,6,6,4,4,4,6,6,6,6,6,4,6,6,6,6,6,4,6,4,6
8/5/2018 17:10,[email protected],Edan,2015,Lexus,Is250,222,J05,Adrian,0,0,0,0,0,0,0,0,6,6,6,0,0,6,6,6,0,0,0,0,0,0,0,0,4
8/5/2018 17:34,[email protected],Hieronymus,1993,Honda,Civic eG,207,J06,Aaron,0,0,2,2,2,2,2,2,0,4,2,2,2,2,2,2,4,2,2,0,0,0,2,2,0
8/5/2018 14:30,[email protected],Nickolas,2016,Ford,Mystang,167,J02,Carl,0,0,2,2,0,2,2,0,0,0,0,2,0,2,2,2,0,0,2,0,0,0,0,0,2
8/5/2018 16:12,[email protected],Martin,2013,Hyundai,Gen coupe,159,J04,Bob,0,0,2,0,0,0,2,0,0,0,0,2,0,2,2,0,2,0,2,0,0,0,0,0,0
8/5/2018 17:00,[email protected],Aldridge,2009,Infiniti,G37,20,J06,Aaron,2,0,2,2,0,0,2,0,0,2,2,2,2,2,2,2,2,2,4,2,2,0,2,0

What my code currently does is sift through the csv file, and pick out the car_id column, year, make, and model columns. Then it runs through every column from racer_turbo to the last, and for each row it adds up the values in those columns into a total value and prints that along side the other values (id, make, model, etc.). There is also a ranking column that precedes the other 5 when printed. Here is my code below.

BEGIN {
    FS = ",";
    OFS = "\t";

    print "Ranking", "Car_ID", "Year", "Make", "Model", "Total";
}
{
    rank;
    total = 0;

    if(NR > 1) {
        for(i = 8; i < NF; i++) {
            total += $i;
        }   
        print ++rank,$7, $4, $5, $6, total;
    }

    rows[$5][total][$0]
}
END {
    print "\n";
    print "Ranking", "Car_ID", "Year", "Make", "Model", "Total";
    ranking;

    PROCINFO["sorted_in"] = "@ind_str_asc"
    for (m in rows) {
        n = asorti(rows[m], t, "@ind_num_desc");
        n = (n>3) ? 3 : n
        for(i = 1; i <= n; i++) for(s in rows[m][t[i]]) {
            $0 = s;
            $1 = ++r;
            print ++ranking, $7, $4, $5, $6, total;
        }
    }   
}

What I would like to do in the END block is print the output again, however, rank the cars by top three from each make using the total column which was created in the preceding block of the code. However, what I run my code now the output looks as follows

Ranking Car_ID  Year    Make    Model   Total
1   48  2015    Acura   TLX 58
2   124 2015    Jeep    Wrangler    118
3   222 2015    Lexus   Is250   36
4   207 1993    Honda   Civic eG    40
5   167 2016    Ford    Mystang 18
6   159 2013    Hyundai Gen coupe   14
7   20  2009    Infiniti    G37 36
...

Ranking Car_ID  Year    Make    Model   Total
1   113 2012    Acura   Tsx sportwagon  10
2   112 2008    Acura   TL  10
3   50  2015    Acura   TLX 10
4   15  2014    Audi    S4  10
5   18  2015    Audi    S3  10
6   116 2008    Audi    A4  10
7   2   2016    Bmw M2  10
8   172 2014    Bmw 4   10
9   28  1995    Bmw 318xi   10
...

See how in the total column on the second printed section it shows total is 10 for each printed car, instead of being the same values as they were in the first printed section for each respective car, and the highest 3 totals for each make being displayed.

Below is the expected output

Ranking Car_ID  Year    Make    Model   Total
1       48      2015    Acura   TLX     58                                                                                                                                                                                                                                                      
2       124     2015    Jeep    Wrangler        118                                                                                                                         
3       222     2015    Lexus   Is250   36                                                                                                                                  
4       207     1993    Honda   Civic eG        40                                                                                                                          
5       167     2016    Ford    Mystang 18                                                                                                                                  
6       159     2013    Hyundai Gen coupe       14                                                                                                                          
7       20      2009    Infiniti        G37     36                                                                                                                          
8       178     2009    Honda   Oddesy  66 
...

Ranking Car_ID  Year    Make    Model   Total
1       112     2008    Acura   TL      110
2       50      2015    Acura   TLX     102
3       127     2013    Acura   Tsx     86
4       15      2014    Audi    S4      120
5       18      2015    Audi    S3      38
6       116     2008    Audi    A4      28
7       2       2016    Bmw     M2      24 
8       172     2014    Bmw     4       22
9       111     2007    Bmw     328i    10
10      218     2010    Chevy   Camaro  64
11      170     2014    Chevy   Cruze   50
12      0       2015    Chevy   Camaro  0
...

Is this salvagable with my current code? Or would a better approach be to create a separate awk file that will sort through the generated output and produce another file that is sorted by the top 3?

I'm running GNU AWK v4.0.2.

4
  • Your input csv file is too short to produce your current output as well as your expected output. (We may extrapolate some data for testing purpose, though.) Commented Apr 5, 2022 at 5:38
  • @tshiono my input file is about 150 lines long, I just pasted a sample of it here because I didn't want to have a wall of it pasted here thats why I posted a snippet. I can add the full file if it will be more helpful. Commented Apr 5, 2022 at 5:40
  • 1
    Thank you for the prompt feedback. I understand your original csv is too long to be copy&pasted. I just wanted to say it will be much better if you can make the input data and the produced output coherent. No need for the full length of the file anyway. BR. Commented Apr 5, 2022 at 5:48
  • @Stephen don't post the full file, instead create a minimal reproducible example (5 or less lines of 5 or less fields should be enough for the input) that demonstrates your problem and the includes a few lines of sample input plus the exact expected output given that input. Commented Apr 5, 2022 at 13:59

1 Answer 1

1

Assuming the Car_ID (hereinafter referred to as id) is unique across the rows, would you please try:

BEGIN {
    FS = ","
    OFS = "\t"

    print "Ranking", "Car_ID", "Year", "Make", "Model", "Total"
}
{
    rank
    total = 0

    if (NR > 1) {
        for (i = 8; i < NF; i++) {
            total += $i
        }   
        print ++rank, $7, $4, $5, $6, total
        ttl[$5][$7] = total
        row[$7] = $0
    }
}
END {
    print "\n"
    print "Ranking", "Car_ID", "Year", "Make", "Model", "Total"
    ranking
    id

    PROCINFO["sorted_in"] = "@ind_str_asc"
    for (m in ttl) {
        n = asorti(ttl[m], t, "@val_num_desc")
        n = (n>3) ? 3 : n
        for (i = 1; i <= n; i++) {
            id = t[i]
            total = ttl[m][id]
            $0 = row[id]
            print ++ranking, $7, $4, $5, $6, total
        }
    }
}

I have slightly modified the data structure, assigning the id as the main key. Then created a 2-D array ttl, which holds the value total keyed by make and id. In the END loop, we can retrieve the input data using the id.
As a side note, your original data structure uses total as an index. If multiple rows with the same make happen to have the same value of total, either of the indexes will be overwritten.

Sign up to request clarification or add additional context in comments.

2 Comments

Thank you, this did exactly what I needed!
Thank you for the feedback. Good to know it works. BR.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.