Bash Scripting compliling specific csv rows

Question

I'm another bash scripting newbie (having just discovered it, it blew my mind! It's so exciting) What I want to do is have a script that compiles a LOT of .csv files into just one bigfile.csv, removing the headers, and inserting my own header. I discovered the following solution:

awk 'FNR > 1' *.csv > bigfile.csv
sed -i 1i"Ident - MD,Node ID,Date,Time,Sub Seq#,NO2..." bigfile.csv

Great! But when I try and use this file for analysis I'm getting errors because of bad lines. I had a look at it and indeed, there are a few crazy entries in there.

Luckily, every row that I want from the original .csv files has the entry "MD" the first column. So does anyone know how I can tell awk to only take the lines form the .csv files that have "MD" in their first cell?

EDIT: Thanks for your help guys, it worked a charm! Unfortunately there's still some weird data in there

CParserError: Error tokenizing data. C error: Expected 51 fields in line 6589, saw 54

With a simple adjustment, is there a way to again only take lines with 51 fields?

I added a condition in my answer which addresses your 51-field requirement. Check it out below. :-) — ghoti
– ghoti, Commented Jul 14, 2016 at 12:34
You have got a few solutions down there but if you're working with large complicated csv files, I suggest you employ perl for the job. And do have a look at this [ library ] — sjsam
– sjsam, Commented Jul 14, 2016 at 12:35
You never need sed when you're using awk. edit your question to include concise, testable sample input and expected output and we can help you. — Ed Morton
– Ed Morton, Commented Jul 15, 2016 at 1:33

ghoti · Accepted Answer · 2016-07-14 13:53:51Z

4

I'm going to go out on a limb here and assume that the line you're adding with sed is actually the headers that you're stripping off.

If that's the case, I'd suggest you skip the sed line, and just tell awk to strip the first line on files that are not the first one.

Next, if you only want lines containing the text MD in the first field, you can test that with a simple regex.

awk -F, '
    FNR==1 && NR > 1 { next }  # skip the header on all but the first file
    NF != 51 { next }          # skip this line if field count is wrong
    $1 ~ /MD/                  # print the line if the first field matches
' *.csv > /path/to/outputfile.csv

The -F, option tells awk to split fields using a comma as field separator.
NR is the total number of records processed, while FNR is the current record number in the current file.
A condition with no commands assumes print as the command (printing the current line).

You can of course put this entire awk script on one line if you like. I split it out for easier reading.

If your outputfile.csv is in the same directory where you are getting your "glob" of input csv files, then be aware that the new file will be created by the shell, not by awk, and might also be processed as an input file. This could be a concern if you were planning to append your redirect to an existing file with >>.

UPDATE

As you've mentioned that the headers you're adding are different from the ones you strip off, you can still avoid using a separate command like sed, by changing the awk script to something like this:

awk -F, '
    BEGIN {
      print "Ident - MD,Node ID,Date,Time,Sub Seq#,NO2..."
    }
    FNR==1 { next }            # skip the header on all files
    NF != 51 { next }          # skip this line if field count is wrong
    $1 ~ /MD/                  # print the line if the first field matches
' *.csv > /path/to/outputfile.csv

Commands within awk's BEGIN block are executed before any input lines are processed, so if you print new headers there, they will appear at the beginning of your (redirected) output. (Note that there is a similar END block if you want to generate a footer/summary/etc after all input has been processed.)

edited Jul 14, 2016 at 13:53

answered Jul 14, 2016 at 12:05

ghoti

47.2k8 gold badges70 silver badges108 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Christopher Turnbull Over a year ago

Thanks this is good to note, but no, the headers where not contained in each file :p

ghoti Over a year ago

@ChristopherTurnbull - Sounds good, I've added an update that still keeps things contained within awk.

sjsam Over a year ago

Did you mean NF != 51?

sjsam Over a year ago

An ^ for great explanation and great formatting too

Christopher Turnbull Over a year ago

Excellent! Works like a treat. Thank you!

|

Community · Accepted Answer · 2017-05-23 12:14:43Z

3

awk 'BEGIN{print "Ident - MD,Node ID,Date,Time,Sub Seq#,NO2..."}
     if(FNR > 1){print}' *.csv > bigfile.csv

FNR resets after each file that awk process, but NR doesn't and NR=FNR only for the first file.

A small Illustration (of course with my test data)

$ cat f1
Name,Roll
A,10
B,5
5$ cat f2
Name,Roll
C,56
D,44
$ awk 'BEGIN{print "Naam,RollNo"}FNR > 1{print}' f*>final
$ cat final 
Naam,RollNo
A,10
B,5
C,56
D,44

Note

As you could see, the new header for the final file went to awk BEGIN section which get executed only at the beginning.

Coming to your objective

Every row that I want from the original .csv files has the entry "MD" the first column

awk 'BEGIN{FS=",";print "Ident - MD,Node ID,Date,Time,Sub Seq#,NO2..."}
     if(FNR > 1 && $1 == "MD" && NF == 51){print}' *.csv > bigfile.csv

Notes

This one has few differences from the first general case.

It introduces , as the field seperator
FNR > 1 && $1 == "MD" means hey I don't want the header(FNR=1) and print stuff only when first field is MD($1 == "MD") and the number of fields is 51(NF == 51)

The Idiomatic way

As [ @ghoti ] mentioned in his comment :

awk's "default" command is already {print}

So the above script may be re-written as :

awk 'BEGIN{FS=",";print "Ident - MD,Node ID,Date,Time,Sub Seq#,NO2..."}
         (FNR > 1 && NF == 51 && $1 == "MD")' *.csv > bigfile.csv

edited May 23, 2017 at 12:14

CommunityBot

11 silver badge

answered Jul 14, 2016 at 11:54

sjsam

22k6 gold badges62 silver badges114 bronze badges

5 Comments

Inian Over a year ago

May be am missing something here, but how does this print as OP needs, only having MD in column 1?

sjsam Over a year ago

@Inian : I will update shortly pal.. My primary objective was to show that he need not use awk and sed together.

ghoti Over a year ago

@sjsam - note that awk's "default" command is already {print}, so you can leave that out of your script, and it will behave the same way.

sjsam Over a year ago

@ghoti : I have updated the answer with a more idiomatic solution.

sjsam Over a year ago

@EdMorton : No specific reason for that. Any disadvantage here?

Inian · Accepted Answer · 2016-07-14 12:54:43Z

2

A fancy one-liner would like:-

awk -F',' 'NR > 1 && $1 ~ /^MD/ && NF == 51 { print }' *.csv > /someotherpath/bigfile.csv

A proper way with the complete bash script would be something like instead of fancy one-liners:-

#!/bin/bash

# Am assuming the the '.csv' files are a single ',' separated 

for i in *.csv; do
    [ -e "$i" ] || continue    # To handle when no input *.csv files present
    awk -F',' 'NR > 1 && $1 ~ /^MD/ && NF == 51  { print }' "$i" > /someotherpath/bigfile.csv
done

The crux of the solution is using awk's NR & NF variables, which keeps track of the current row and the nth field within the row, so ideally NR > 1 would skip the header part from being processed and $1 ~ /^MD/ returns only the lines in the file whose first column starts with the pattern and NF ==51 prints those lines containing exactly 51 fields.

edited Jul 14, 2016 at 12:54

answered Jul 14, 2016 at 11:57

Inian

87k15 gold badges166 silver badges183 bronze badges

1 Comment

ghoti Over a year ago

You could probably eliminate your [ -e "$i" ] test by setting shopt -s nullglob in bash.

Collectives™ on Stack Overflow

Bash Scripting compliling specific csv rows

3 Answers 3

6 Comments

5 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

6 Comments

5 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related