Create a new file with Names and path to the files in linux

Question

I have several fastq.gz files, both R1 and R2 in a directory dir on a Linux system. It looks like:

dir
 |____sampleA_1.fastq.gz
 |____sampleA_2.fastq.gz
 |____sampleB_1.fastq.gz
 |____sampleB_2.fastq.gz
 |____sampleC_1.fastq.gz
 |____sampleC_2.fastq.gz

I wanted to create a txt file with sample name as first column, path to R1 fastq as second column and path to R2 fastq as third column.

Inside dir I tried in the following way:

find "$PWD" -name \*1.fastq.gz > list1.txt
find "$PWD" -name \*2.fastq.gz > list2.txt

And again I have to merge those two files and give a column name and again create another column with sample names. Instead, Is there a way to make the file with a single command?

And txt file should look like below:

sample            Second                    Third
sampleA    dir/sampleA_1.fastq.gz     dir/sampleA_2.fastq.gz
sampleB    dir/sampleB_1.fastq.gz     dir/sampleB_2.fastq.gz
sampleC    dir/sampleC_1.fastq.gz     dir/sampleC_2.fastq.gz

Is that really the format your files are in? Aren't they something like sampleA_L001_R1_001.fastq.gz and sampleA_L001_R2_001.fastq.gz? Do you really only have one underscore (_) per file name? Also, can you be sure you will never have more than a single pair of reads per sample? Depending on coverage and size of the assay's target regions, you can have several. — terdon
– terdon ♦, Commented Jul 18, 2022 at 16:36
Assuming there’s a common key between items (sample name?), a clever join might be able to do it — D. Ben Knoble
– D. Ben Knoble, Commented Jul 19, 2022 at 3:10
You might garner more interest in your question if you add keywords in your first paragraph (something like "Paired-End Sequencing"), and add a reference link. This helps orient potential answerers. — jubilatious1
– jubilatious1, Commented Jul 19, 2022 at 15:56

U. Windl · Accepted Answer · 2022-07-22 07:46:17Z

3

If you can guarantee that there are always a pair of samples, this bash/ksh code will generate the output based on the presence of all the sample 1 files:

Example (set up demo environment):

mkdir -p /tmp/710303/dir
cd /tmp/710303
touch dir/sample{A,B,C}_{1,2}.fastq.gz       # Assumes a { }-aware shell

File generation (work on the demo environment)

printf "%s %s %s\n" 'sample' 'Second' 'Third'
for f1 in dir/sample*_1.fastq*               # Loop through all first samples
do
    fn="${f1##*/}"; fn="${fn%%_*}"           # Label
    f2="${f1/1/2}"                           # Filename for second sample
    printf "%s %s %s\n" "$fn" "$f1" "$f2"    # Output the values
done

Output

sample Second Third
sampleA dir/sampleA_1.fastq.gz dir/sampleA_2.fastq.gz
sampleB dir/sampleB_1.fastq.gz dir/sampleB_2.fastq.gz
sampleC dir/sampleC_1.fastq.gz dir/sampleC_2.fastq.gz

These are space-separated columns. If you want tab-separated then change the printf format lines to use \t (tab) instead of (space).

edited Jul 22, 2022 at 7:46

U. Windl

1,77716 silver badges34 bronze badges

answered Jul 18, 2022 at 15:58

Chris Davies

128k16 gold badges179 silver badges324 bronze badges

it should give a path to the second and third columns. No need a path for the first column. And how do I give the names for those columns?

stack_learner
– stack_learner

2022-07-18 16:06:37 +00:00
Commented Jul 18, 2022 at 16:06
@stack_learner titles and path added

Chris Davies
– Chris Davies

2022-07-18 16:11:36 +00:00
Commented Jul 18, 2022 at 16:11
Thank you very much. In the above script instead of dir I added "$PWD".

stack_learner
– stack_learner

2022-07-18 20:32:12 +00:00
Commented Jul 18, 2022 at 20:32
1

The original version of the script would have suited you better, and then you would have just needed to add $PWD to the output values

Chris Davies
– Chris Davies

2022-07-18 20:34:18 +00:00
Commented Jul 18, 2022 at 20:34

Add a comment |

glenn jackman · Accepted Answer · 2022-07-18 16:24:59Z

This looks needlessly complicated, but it's handling the case where only one of the sample's files are present

{
    printf '%s\n' sample Second Third

    find ./dir/ -type f -name '*.fastq.gz' -print \
    | cut -d _ -f 1 \
    | sort -u \
    | bash -c '
        while read -r root; do
            echo "${root##*/}"
            for i in 1 2; do
                f="${root}_${i}.fastq.gz"
                [[ -f "$f" ]] && echo "$f" || echo ""
            done
        done
      ' 
} \
| paste - - - \
| column -s $'\t' -t

Testing:

mkdir dir
touch dir/sample{A,B,C}_{1,2}.fastq.gz
touch dir/sample{D_1,E_2}.fastq.gz
touch dir/ignore.me

Then the above command outputs

sample   Second                    Third
sampleA  ./dir/sampleA_1.fastq.gz  ./dir/sampleA_2.fastq.gz
sampleB  ./dir/sampleB_1.fastq.gz  ./dir/sampleB_2.fastq.gz
sampleC  ./dir/sampleC_1.fastq.gz  ./dir/sampleC_2.fastq.gz
sampleD  ./dir/sampleD_1.fastq.gz  
sampleE                            ./dir/sampleE_2.fastq.gz

Maybe this GNU awk version is a bit tidier:

find ./dir -type f | gawk -F/ -v OFS='\t' '
    BEGIN { print "sample", "Second", "Third" }
    match($NF, /^(.*)_([12]).fastq.gz$/, m) {
        file[m[1]][m[2]] = $0
    }
    END {
        PROCINFO["sorted_in"] = "@ind_str_asc"
        for (sample in file)
            print sample, file[sample][1], file[sample][2]
    }
' | column -s $'\t' -t

Produces same output as above.

sorry, does your code work if the file are tq.gz instead of fastq.gz? — stack_learner
– stack_learner, Commented Jul 25, 2022 at 14:56
I hardcode "fast.gz" so no. Your question does not require otherwise. — glenn jackman
– glenn jackman, Commented Jul 25, 2022 at 18:10

Ed Morton · Accepted Answer · 2022-07-19 00:49:49Z

2

$ cat tst.awk
BEGIN {
    FS="[/_]"; OFS="\t"
    print "sample", "Second", "Third"
}
NR%2 { second = $0; next }
{ print $2, second, $0 }

$ printf '%s\n' dir/* | awk -f tst.awk
sample  Second  Third
sampleA dir/sampleA_1.fastq.gz  dir/sampleA_2.fastq.gz
sampleB dir/sampleB_1.fastq.gz  dir/sampleB_2.fastq.gz
sampleC dir/sampleC_1.fastq.gz  dir/sampleC_2.fastq.gz

answered Jul 19, 2022 at 0:49

Ed Morton

36k6 gold badges25 silver badges60 bronze badges

@Morton instead of dir/* I tries giving "$PWD" but it didn't work. Is there a way for that?

stack_learner
– stack_learner

2022-07-26 14:24:59 +00:00
Commented Jul 26, 2022 at 14:24
$PWD contains the current directory name (Present Working Directory), you need to pass file names to the command. Sounds like you might want ./* or "$PWD"/*.

Ed Morton
– Ed Morton

2022-07-26 14:37:47 +00:00
Commented Jul 26, 2022 at 14:37
Yes I knew this. I tried "$PWD"/* this, Second and Third columns are ok, but column sample doesn't show the right names, instead it is showing the first directory of the Present Working Directory

stack_learner
– stack_learner

2022-07-26 15:31:45 +00:00
Commented Jul 26, 2022 at 15:31
idk, the script I provided produces the output you asked for from the sample input you provided so if there's some other input it doesn't work for then you'd have to provide that in your question. Looks like you already accepted an answer to this question, though, so you should ask a new followup question.

Ed Morton
– Ed Morton

2022-07-26 15:37:55 +00:00
Commented Jul 26, 2022 at 15:37

Add a comment |

Stack Exchange Network

Create a new file with Names and path to the files in linux

3 Answers 3

You must log in to answer this question.

Hot Network Questions

Create a new file with Names and path to the files in linux

3 Answers 3

You must log in to answer this question.

Related

Hot Network Questions