0

I'm another bash scripting newbie (having just discovered it, it blew my mind! It's so exciting) What I want to do is have a script that compiles a LOT of .csv files into just one bigfile.csv, removing the headers, and inserting my own header. I discovered the following solution:

awk 'FNR > 1' *.csv > bigfile.csv
sed -i 1i"Ident - MD,Node ID,Date,Time,Sub Seq#,NO2..." bigfile.csv

Great! But when I try and use this file for analysis I'm getting errors because of bad lines. I had a look at it and indeed, there are a few crazy entries in there.

Luckily, every row that I want from the original .csv files has the entry "MD" the first column. So does anyone know how I can tell awk to only take the lines form the .csv files that have "MD" in their first cell?

EDIT: Thanks for your help guys, it worked a charm! Unfortunately there's still some weird data in there

CParserError: Error tokenizing data. C error: Expected 51 fields in line 6589, saw 54

With a simple adjustment, is there a way to again only take lines with 51 fields?

5
  • Provide an example of how the .csv file would look like Commented Jul 14, 2016 at 11:51
  • I added a condition in my answer which addresses your 51-field requirement. Check it out below. :-) Commented Jul 14, 2016 at 12:34
  • You have got a few solutions down there but if you're working with large complicated csv files, I suggest you employ perl for the job. And do have a look at this [ library ] Commented Jul 14, 2016 at 12:35
  • Check my updated answer! Commented Jul 14, 2016 at 12:57
  • You never need sed when you're using awk. edit your question to include concise, testable sample input and expected output and we can help you. Commented Jul 15, 2016 at 1:33

3 Answers 3

4

I'm going to go out on a limb here and assume that the line you're adding with sed is actually the headers that you're stripping off.

If that's the case, I'd suggest you skip the sed line, and just tell awk to strip the first line on files that are not the first one.

Next, if you only want lines containing the text MD in the first field, you can test that with a simple regex.

awk -F, '
    FNR==1 && NR > 1 { next }  # skip the header on all but the first file
    NF != 51 { next }          # skip this line if field count is wrong
    $1 ~ /MD/                  # print the line if the first field matches
' *.csv > /path/to/outputfile.csv
  • The -F, option tells awk to split fields using a comma as field separator.
  • NR is the total number of records processed, while FNR is the current record number in the current file.
  • A condition with no commands assumes print as the command (printing the current line).

You can of course put this entire awk script on one line if you like. I split it out for easier reading.

If your outputfile.csv is in the same directory where you are getting your "glob" of input csv files, then be aware that the new file will be created by the shell, not by awk, and might also be processed as an input file. This could be a concern if you were planning to append your redirect to an existing file with >>.

UPDATE

As you've mentioned that the headers you're adding are different from the ones you strip off, you can still avoid using a separate command like sed, by changing the awk script to something like this:

awk -F, '
    BEGIN {
      print "Ident - MD,Node ID,Date,Time,Sub Seq#,NO2..."
    }
    FNR==1 { next }            # skip the header on all files
    NF != 51 { next }          # skip this line if field count is wrong
    $1 ~ /MD/                  # print the line if the first field matches
' *.csv > /path/to/outputfile.csv

Commands within awk's BEGIN block are executed before any input lines are processed, so if you print new headers there, they will appear at the beginning of your (redirected) output. (Note that there is a similar END block if you want to generate a footer/summary/etc after all input has been processed.)

Sign up to request clarification or add additional context in comments.

6 Comments

Thanks this is good to note, but no, the headers where not contained in each file :p
@ChristopherTurnbull - Sounds good, I've added an update that still keeps things contained within awk.
Did you mean NF != 51?
An ^ for great explanation and great formatting too
Excellent! Works like a treat. Thank you!
|
3
awk 'BEGIN{print "Ident - MD,Node ID,Date,Time,Sub Seq#,NO2..."}
     if(FNR > 1){print}' *.csv > bigfile.csv

FNR resets after each file that awk process, but NR doesn't and NR=FNR only for the first file.


A small Illustration (of course with my test data)

$ cat f1
Name,Roll
A,10
B,5
5$ cat f2
Name,Roll
C,56
D,44
$ awk 'BEGIN{print "Naam,RollNo"}FNR > 1{print}' f*>final
$ cat final 
Naam,RollNo
A,10
B,5
C,56
D,44

Note

As you could see, the new header for the final file went to awk BEGIN section which get executed only at the beginning.


Coming to your objective

Every row that I want from the original .csv files has the entry "MD" the first column

awk 'BEGIN{FS=",";print "Ident - MD,Node ID,Date,Time,Sub Seq#,NO2..."}
     if(FNR > 1 && $1 == "MD" && NF == 51){print}' *.csv > bigfile.csv

Notes

This one has few differences from the first general case.

  • It introduces , as the field seperator
  • FNR > 1 && $1 == "MD" means hey I don't want the header(FNR=1) and print stuff only when first field is MD($1 == "MD") and the number of fields is 51(NF == 51)

The Idiomatic way

As [ @ghoti ] mentioned in his comment :

awk's "default" command is already {print}

So the above script may be re-written as :

awk 'BEGIN{FS=",";print "Ident - MD,Node ID,Date,Time,Sub Seq#,NO2..."}
         (FNR > 1 && NF == 51 && $1 == "MD")' *.csv > bigfile.csv

5 Comments

May be am missing something here, but how does this print as OP needs, only having MD in column 1?
@Inian : I will update shortly pal.. My primary objective was to show that he need not use awk and sed together.
@sjsam - note that awk's "default" command is already {print}, so you can leave that out of your script, and it will behave the same way.
@ghoti : I have updated the answer with a more idiomatic solution.
@EdMorton : No specific reason for that. Any disadvantage here?
2

A fancy one-liner would like:-

awk -F',' 'NR > 1 && $1 ~ /^MD/ && NF == 51 { print }' *.csv > /someotherpath/bigfile.csv

A proper way with the complete bash script would be something like instead of fancy one-liners:-

#!/bin/bash

# Am assuming the the '.csv' files are a single ',' separated 

for i in *.csv; do
    [ -e "$i" ] || continue    # To handle when no input *.csv files present
    awk -F',' 'NR > 1 && $1 ~ /^MD/ && NF == 51  { print }' "$i" > /someotherpath/bigfile.csv
done

The crux of the solution is using awk's NR & NF variables, which keeps track of the current row and the nth field within the row, so ideally NR > 1 would skip the header part from being processed and $1 ~ /^MD/ returns only the lines in the file whose first column starts with the pattern and NF ==51 prints those lines containing exactly 51 fields.

1 Comment

You could probably eliminate your [ -e "$i" ] test by setting shopt -s nullglob in bash.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.