How to loop a variable range in cut command

Question

I have a file with 2 columns, and i want to use the values from the second column to set the range in the cut command to select a range of characters from another file. The range i desire is the character in the position of the value in the second column plus the next 10 characters. I will give an example in a while.

My files are something like that:

File with 2 columns and no blank lines between lines (file1.txt):

NAME1 10
NAME2 25
NAME3 48
NAME4 66

File that i want to extract the variable range of characters(just one very long line with no spaces and no bold font) (file2.txt):

GATCGAGCGGGATTCTTTTTTTTTAGGCGAGTCAGCTAGCATCAGCTACGAGAGGCGAGGGCGGGCTATCACGACTACGACTACGACTACAGCATCAGCATCAGCGCACTAGAGCGAGGCTAGCTAGCTACGACTACGATCAGCATCGCACATCGACTACGATCAGCATCAGCTACGCATCGAAGAGAGAGC

...or, more literally (for copy/paste to test):

GATCGAGCGGGATTCTTTTTTTTTAGGCGAGTCAGCTAGCATCAGCTACGAGAGGCGAGGGCGGGCTATCACGACTACGACTACGACTACAGCATCAGCATCAGCGCACTAGAGCGAGGCTAGCTAGCTACGACTACGATCAGCATCGCACATCGACTACGATCAGCATCAGCTACGCATCGAAGAGAGAGC

Desired resulting file, one sequence per line (result.txt):

GATTCTTTTT
GGCGAGTCAG
CGAGAGGCGA
TATCACGACT

The resulting file would have the characters from 10-20, 25-35, 48-58 and 66-76, each range in a new line. So, it would always keep the range of 10, but in different start points and those start points are set by the values in the second column from the first file.

I tried the command:

for i in $(awk '{print $2}' file1.txt);
do
        p1=$i;
        p2=`expr "$1" + 10`
        cut -c$p1-$2 file2.txt > result.txt;
done

I don't get any output or error message.

I also tried:

while read line; do
    set $line
    p2=`expr "$2" + 10`
    cut -c$2-$p2 file2.txt > result.txt;
done <file1.txt

This last command gives me an error message:

cut: invalid range with no endpoint: -
Try 'cut --help' for more information.
expr: non-integer argument

It is a very good question for being the first. It is clear and it shows effort. — klutt
– klutt, Commented Nov 7, 2017 at 18:27

Charles Duffy · Accepted Answer · 2017-11-07 18:47:58Z

4

There's no need for cut here; dd can do the job of indexing into a file, and reading only the number of bytes you want. (Note that status=none is a GNUism; you may need to leave it out on other platforms and redirect stderr otherwise if you want to suppress informational logging).

while read -r name index _; do
  dd if=file2.txt bs=1 skip="$index" count=10 status=none
  printf '\n'
done <file1.txt >result.txt

This approach avoids excessive memory requirements (as present when reading the whole of file2 -- assuming it's large), and has bounded performance requirements (overhead is equal to starting one copy of dd per sequence to extract).

edited Nov 7, 2017 at 18:47

answered Nov 7, 2017 at 18:21

Charles Duffy

299k43 gold badges441 silver badges496 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Fernanda Costa Over a year ago

Great suggestion! Thank you very much. It works perfectly.

Rahul Verma · Accepted Answer · 2017-11-07 18:28:26Z

3

Using awk

$ awk 'FNR==NR{a=$0; next} {print substr(a,$2+1,10)}' file2 file1
GATTCTTTTT
GGCGAGTCAG
CGAGAGGCGA
TATCACGACT

answered Nov 7, 2017 at 18:28

Rahul Verma

3,1091 gold badge17 silver badges28 bronze badges

3 Comments

Charles Duffy Over a year ago

Hmm. This stores all of file2 in memory, right? So it looks like it'd be a good solution if file1 is long (since awk loops much faster than a bash while read loop does), but not so much if file2 is long (beyond what can fit in RAM).

Walter A Over a year ago

@CharlesDuffy When file2 is long data=(<file2.txt) in another solution is difficult too.

Charles Duffy Over a year ago

Yes, I agree -- that's why I commented that I like that solution "if file2.txt is small/short" (and its author made the limitation clear in the surrounding prose), and why I think my own solution has a niche where it's the best choice (if data2 is potentially too large to store in RAM).

janos · Accepted Answer · 2017-11-07 18:37:41Z

2

If file2.txt is not too large, then you can read it in memory, and use Bash sub-strings to extract the desired ranges:

data=$(<file2.txt)
while read -r name index _; do
  echo "${data:$index:10}"
done <file1.txt >result.txt

This will be much more efficient than running cut or another process for every single range definition.

(Thanks to @CharlesDuffy for the tip to read data without a useless cat, and the while loop.)

edited Nov 7, 2017 at 18:37

answered Nov 7, 2017 at 18:26

janos

126k31 gold badges242 silver badges253 bronze badges

2 Comments

Charles Duffy Over a year ago

data=$(<file2.txt), to avoid the cost of running an external cat. I agree that this is the best answer if file2.txt is small/short.

Fernanda Costa Over a year ago

file2.txt is a full eukaryotic genome, so it's a large file, but your solution is great for small genomes, like prokariotic ones. Thank you for your suggestion.

klutt · Accepted Answer · 2017-11-09 13:51:41Z

1

One way to solve it:

#!/bin/bash                                                                                                        

while read line; do
    pos=$(echo "$line" | cut -f2 -d' ')
    x=$(head -c $(( $pos + 10 )) file2.txt | tail -c 10)
    echo "$x"
done < file1.txt > result.txt

It's not the solution an experienced bash hacker would use, but it is very good for someone who is new to bash. It uses tools that are very versatile, although somewhat bad if you need high performance. Shell scripting is commonly used by people who rarely shell scripts, but knows a few commands and just wants to get the job done. That's why I'm including this solution, even if the other answers are superior for more experienced people.

The first line is pretty easy. It just extracts the numbers from file1.txt. The second line uses the very nice tools head and tail. Usually, they are used with lines instead of characters. Nevertheless, I print the first pos + 10 characters with head. The result is piped into tail which prints the last 10 characters.

Thanks to @CharlesDuffy for improvements.

edited Nov 9, 2017 at 13:51

answered Nov 7, 2017 at 18:24

klutt

31.7k19 gold badges64 silver badges114 bronze badges

2 Comments

Charles Duffy Over a year ago

(I'd also suggest avoiding subshells in an inner loop, especially when they're trivially avoidable; every $( ... ) is a fork() and a wait()).

Fernanda Costa Over a year ago

Since i'm a bash newbie, i really appreciate your solution. It makes sense to me. Thank you!

Collectives™ on Stack Overflow

How to loop a variable range in cut command

4 Answers 4

1 Comment

3 Comments

2 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

3 Comments

2 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related