1

Say I have 20 different files. First 10 files end with .counts.tsv and the rest of the files end with .libsize.tsv. For each .counts.tsv there are matching .libsize.tsv files. I would like to use a for loop for selecting both of these files and run an R script for on those two files types. Here is what I tried,

#!/bin/bash
arti='/home/path/tofiles'
for counts in ${arti}/*__counts.tsv ; do
    for libsize in "$arti"/*__libsize.tsv ; do
        Rscript score.R  ${counts} ${libsize}
 done;
done;

The above shell script iterates over the files more than 200 times whereas I have only 20 files. I need the Rscript to be executed 10 times for both files. Any suggestions would be appreciated.

10
  • What do you want to do at the end of this script? Commented Jun 13, 2019 at 14:40
  • 1
    In the end, I need to execute the R script on each counts and libsize Commented Jun 13, 2019 at 14:41
  • "10 times for both files" so 20 iterations total? Hopefully the files are named with similar first parts, ie do you have myFile.libsize.tsv and myFile.__counts.tsv Then you only need 1 loop, strip out the extension from the variable returned by the loop and add it back in to 2 copies on your Rscript line, ie. Rscript ${myF}.__counts.tsv ${myF}.__libsize.tsv. Good luck. Commented Jun 13, 2019 at 14:53
  • The Rscript should only run 10 times. Hence, 10 iterations. So I think I need to be more clear here, for every .count.tsv file there is a matching .libsize.tsv is present therefore in total 20. Therefore, at the end the Rscript should only iterate 10 times Commented Jun 13, 2019 at 14:55
  • 1
    Ah, let's take the R tag off this then. Commented Jun 13, 2019 at 16:16

5 Answers 5

3

I started typing up an answer before seeing your comment that you're only interested in a bash solution, posting anyway in case someone finds this question in the future and is open to an R based solution.

If I were approaching this from scratch, I'd probably just use an R function defined in the file that takes the two file names instead of messing around with the system() calls, but this would provide the behavior you desire.

## Get a vector of files matching each extension
counts_names <- list.files(path = ".", pattern ="*.counts.tsv")
libsize_names <- list.files(path = ".", pattern ="*.libsize.tsv")

## Get the root names of the files before the extensions
counts_roots <- gsub(".counts.tsv$", "",counts_names)
libsize_roots <- gsub(".libsize.tsv$", "",libsize_names)

## Get only root names that have both file types
shared_roots <- intersect(libsize_roots,counts_roots)

## Loop through the shared root names and execute an Rscript call based on the two files
for(i in seq_along(shared_roots)){

  counts_filename <- paste0(shared_roots[[i]],".counts.tsv")
  libsize_filename <- paste0(shared_roots[[i]],".libsize.tsv")

  Command  <- paste("Rscript score.R",counts_filename,libsize_filename)
  system(Command)

}
Sign up to request clarification or add additional context in comments.

Comments

3

Construct the second filename with ${counts%counts.tsv} (remove last part).

#!/bin/bash
arti='/home/path/tofiles'
for counts in ${arti}/*__counts.tsv ; do
    libsize="${counts%counts.tsv}libsize.tsv"
    Rscript score.R "${counts}" "${libsize}"
done

EDIT:
Less safe is trying to make it an oneliner. When the filenames are without spaces and newlines, you can risk an accident with

echo ${arti}/*counts.tsv ${arti}/*.libsize.tsv | xargs -n2 Rscript score.R

and when you feel really lucky (with no other files than those tsv files in $arti) make a bungee jump with

echo ${arti}/* | xargs -n2 Rscript score.R

2 Comments

Thanks, I have another solution posted below :)
Your solution is the same idea, using both basename and awk is slower. In this case the performance won't matter, it will be important when you want to loop through large files and do something for each line.
1

Have you tried list.files in base? This will allow you to use all files in the folder.

arti='/home/path/tofiles'
for i in list.files(arti) {
  script
}

3 Comments

The files I need are of two different extensions. Say I have file that ends with counts.tsvand libsize.tsv these files needed to be selected separately for the Rscript. Hence, your solution won't work.
@user1017373: This is almost certainly going to be the right tool to use, though. Perhaps you'll need to separate the list somehow after you get it? Please clarify the question, it's not clear how with 10 files of each type, you want the script to run only 10 times. There's something you're not telling us...
@Aaron, Thanks for the comment. Yes, for instance, I have 10 samples with counts.tsv files and a matching libsize.tsv file. Therefore, at the end I need only 10 ierations, however in the folder i have 20 files
1

See whether the below helps.

my_list = list.files("./Data")
counts = grep("counts.tsv", my_list, value=T)
libsize = grep("libsize.tsv", my_list, value=T)

for (i in seq(length(counts))){
  system(paste("Rscript score.R",counts[i],libsize[i]))
}

2 Comments

This seems like a mix of bash and R and so wouldn't actually run; am I missing something?
The idea was to bring both the files simultaneously inside the for loop. Editing the answer.
0

Finally,

I tried the following and it helped me,

for sam in "$arti"/*__counts.tsv ; do
      filebase=$(basename $sam)
      samples=$(ls -1 ${filebase}|awk -F'[-1]' '{print $1}')
        Rscript score.R ${samples}__counts.tsv ${samples}__libsize.tsv
 done;

For someone looking for something similar :)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.