Unix shell script - select columns from csv file based on headers in another csv file (reposted with minimal reproducible example)

Question

Reposting this question as previous answer didn't work, due to lack of minimal reproducible example (mea culpa). Sorry if this is basic but I cannot get it to work, and have spent many hours trying.

Please see previous question I posted earlier: Unix shell script select columns in csv file based on headers from another csv file

I created a csv header file, where each row in the header file is the name of the column I want. In the data_file.csv itself, the first row appears as follows, with each of the column headers in the first row, with the data enclosed in quotes:

echo $(head -n 1 data_file.csv)
"eid","132421-0.0","132422-0.0","132423-0.0", ...

The header file I created looks like this, with each of the column headers as a row without quotes.

eid
24500-0.0
24503-0.0
24503-1.0
4526-0.0
4526-1.0

Notice no quotes. If I try to add quotes (manually) to the headers.csv file, and then using $cat again, I get three lots of quotes on each of the header rows (don't know why).

"""eid"""
"""24500-0.0"""
"""24500-1.0"""
"""24503-0.0"""
"""24503-1.0"""
"""4526-0.0"""
"""4526-1.0"""

All I want to do is extract the 20 columns with the headers as listed in the headers.csv file from the enormous data_file.csv (which has 28,000 columns). Then I can load those into R and away I go.

The data itself is a mix of characters and numerics, with each field enclosed in quotes.

@glenn_jackman suggested the following solution but I didn't point out the quotes:

awk '
    BEGIN {FS = OFS = ","}
    NR == FNR {wanted[$0] = 1; next}
    FNR == 1 {
        ncol = 0
        for (i = 1; i <= NR; i++)
            if ($i in wanted)
                columns[++ncol] = i
    }
    {
        for (i = 1; i <= ncol; i++)
            printf "%s%s", $columns[i], OFS
        print ""
    }
' headers.csv data_file.csv > selected_data_file.csv

Therefore this fails and I get a blank selected_data_file.csv.

The output I am looking for is:

$ cat selected_data_file.csv
"eid", "24500-0.0", "24503-0.0", "24503-1.0", "4526-0.0", "4526-1.0"
"AB1","1","a","0","1.2",""

with the same number of rows as data_file.csv.

Don't know how to make it any clearer or more reproducible than that ... very many thanks for any help.

can any of the header or data fields contain commas? or do commas only show up as field delimiters? also, can any fields contain embedded linefeeds? — markp-fuso
– markp-fuso, Commented Jun 22, 2023 at 15:33
your one-line input sample shows no spaces after the comma; the header line in your desired output shows a space after the comma, but no spaces (after the comma) in the data line; please confirm a) if spaces (after commas) exist in the input file and b) if you really want spaces (after commas) in the output but only for the header line; while this may sound like a nitpick, these inconsistencies add to the complexity of the code (assuming you're not using a tool geared specifically for processing csv files) — markp-fuso
– markp-fuso, Commented Jun 22, 2023 at 15:36
fwiw, a minimal example would consist of, say, 5-6 columns, 3-4 lines (including header), samples of fields that are indicative of your actual data file (eg, fields including commas, fields not wrapped in quotes, spaces around the comma/delimiter), and the expected output would be a subset of those 5-6 columns — markp-fuso
– markp-fuso, Commented Jun 22, 2023 at 15:49
If I understand the problem right, csvcut from the csvkit package should be pretty easy to make work. — Shawn
– Shawn, Commented Jun 22, 2023 at 16:11

jhnc · Accepted Answer · 2023-06-23 05:43:25Z

You state in a comment that you are working with a UK Biobank dataset. Biobank provides a conversion utility ukbconv for Windows and linux: https://biobank.ctsu.ox.ac.uk/crystal/download.cgi

According to this pdf and the official documentation, given a list of field numbers in a text file, the command to directly extract the relevant columns from the original file into a format suitable for R is:

ukbconv dataset.enc_ukb r -ifield_list.txt

The documentation distinguishes between "fields" and "columns":

For example, assume we just want to extract fields 31, 20204 and 40000 from our dataset and convert it to csv format. We create a text file called field_list.txt with the contents:
31
20204
40000
and place it into the same folder as ukbconv. We then run the command:
ukbconv ukb23456.enc_ukb csv -ifield_list.txt
The resulting output will only contain the eid column and columns for all instance and array combinations for Data-Fields 31, 20204 and 40000 (assuming those Data-Fields are present in the .enc_ukb file).

To assist with preparing the file giving the list of required fields, ukbconv outputs the file named field.ukb each time it is run which lists all the available fields associated with the dataset. This can be edited to identify the particular fields which are to be included in or excluded from the subset.

It also discusses "instance and array indices" and notes that in the case of csv:

column headers are in the format F-I.A where F is the Data-Field number, I is the instance index and A is the array index

That looks like the header format from your question, so it may be that by using ukbconv you will end up with more data than you asked for (ie. extra columns). This may or not be a problem for you.

markp-fuso · Accepted Answer · 2023-06-22 18:55:57Z

Assumptions/Understandings derived from the question and OP's comments:

all fields (including header fields) are enclosed in double quotes
fields are separated by commas and possibly spaces on either side of the comma
fields may contain commas
fields do not contain embedded linefeeds

Making up some sample data based on (above) assumptions:

$ cat headers.csv
eid
24503-1.0
4526-1.0

$ cat data_file.csv
"eid","24500-0.0", "24503-0.0", "24503-1.0","4526-0.0", "4526-1.0"
"AB1","1","a","0,111","1.2",""
"CD2","2","b","9","","-123,be"

One GNU awk (for FPAT support):

awk '
BEGIN   { FPAT = "([^,]+)|(\"[^\"]+\")" }                   # define field patterns

        # remove following block if we do NOT have to worry about white space before/after the comma delimiter

        { for ( i=1;i<=NF;i++ )                             # for all fields ...
              gsub(/^[[:space:]]+|[[:space:]]+$/,"",$i)     # strip leading/trailing white space
        }

FNR==NR { hdr["\"" $1 "\""]                                 # 1st file: populate array of headers
          next
        }

FNR==1  { for ( i=1;i<=NF;i++ )                             # 2nd file: process header fields
              if ( $i in hdr )                              # if in our hdr[] array then 
                 cols[++colcnt] = i                         # populate array of columns making note of their order
        }
        { for ( i=1;i<=colcnt;i++ )                         # 2nd file: for each data line loop through list of desired columns and print to stdout
              printf "%s%s", $(cols[i]), (i<colcnt ? "," : ORS)
        }
' headers.csv data_file.csv

This generates:

"eid","24503-1.0","4526-1.0"
"AB1","0,111",""
"CD2","9","-123,be"

Collectives™ on Stack Overflow

Unix shell script - select columns from csv file based on headers in another csv file (reposted with minimal reproducible example)

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related