5

I have a file with 2 columns, and i want to use the values from the second column to set the range in the cut command to select a range of characters from another file. The range i desire is the character in the position of the value in the second column plus the next 10 characters. I will give an example in a while.

My files are something like that:

File with 2 columns and no blank lines between lines (file1.txt):

NAME1 10
NAME2 25
NAME3 48
NAME4 66

File that i want to extract the variable range of characters(just one very long line with no spaces and no bold font) (file2.txt):

GATCGAGCGGGATTCTTTTTTTTTAGGCGAGTCAGCTAGCATCAGCTACGAGAGGCGAGGGCGGGCTATCACGACTACGACTACGACTACAGCATCAGCATCAGCGCACTAGAGCGAGGCTAGCTAGCTACGACTACGATCAGCATCGCACATCGACTACGATCAGCATCAGCTACGCATCGAAGAGAGAGC

...or, more literally (for copy/paste to test):

GATCGAGCGGGATTCTTTTTTTTTAGGCGAGTCAGCTAGCATCAGCTACGAGAGGCGAGGGCGGGCTATCACGACTACGACTACGACTACAGCATCAGCATCAGCGCACTAGAGCGAGGCTAGCTAGCTACGACTACGATCAGCATCGCACATCGACTACGATCAGCATCAGCTACGCATCGAAGAGAGAGC

Desired resulting file, one sequence per line (result.txt):

GATTCTTTTT
GGCGAGTCAG
CGAGAGGCGA
TATCACGACT

The resulting file would have the characters from 10-20, 25-35, 48-58 and 66-76, each range in a new line. So, it would always keep the range of 10, but in different start points and those start points are set by the values in the second column from the first file.

I tried the command:

for i in $(awk '{print $2}' file1.txt);
do
        p1=$i;
        p2=`expr "$1" + 10`
        cut -c$p1-$2 file2.txt > result.txt;
done

I don't get any output or error message.

I also tried:

while read line; do
    set $line
    p2=`expr "$2" + 10`
    cut -c$2-$p2 file2.txt > result.txt;
done <file1.txt

This last command gives me an error message:

cut: invalid range with no endpoint: -
Try 'cut --help' for more information.
expr: non-integer argument
2
  • It is a very good question for being the first. It is clear and it shows effort. Commented Nov 7, 2017 at 18:27
  • Remember to accept an answer. Commented Nov 7, 2017 at 19:20

4 Answers 4

4

There's no need for cut here; dd can do the job of indexing into a file, and reading only the number of bytes you want. (Note that status=none is a GNUism; you may need to leave it out on other platforms and redirect stderr otherwise if you want to suppress informational logging).

while read -r name index _; do
  dd if=file2.txt bs=1 skip="$index" count=10 status=none
  printf '\n'
done <file1.txt >result.txt

This approach avoids excessive memory requirements (as present when reading the whole of file2 -- assuming it's large), and has bounded performance requirements (overhead is equal to starting one copy of dd per sequence to extract).

Sign up to request clarification or add additional context in comments.

1 Comment

Great suggestion! Thank you very much. It works perfectly.
3

Using awk

$ awk 'FNR==NR{a=$0; next} {print substr(a,$2+1,10)}' file2 file1
GATTCTTTTT
GGCGAGTCAG
CGAGAGGCGA
TATCACGACT

3 Comments

Hmm. This stores all of file2 in memory, right? So it looks like it'd be a good solution if file1 is long (since awk loops much faster than a bash while read loop does), but not so much if file2 is long (beyond what can fit in RAM).
@CharlesDuffy When file2 is long data=(<file2.txt) in another solution is difficult too.
Yes, I agree -- that's why I commented that I like that solution "if file2.txt is small/short" (and its author made the limitation clear in the surrounding prose), and why I think my own solution has a niche where it's the best choice (if data2 is potentially too large to store in RAM).
2

If file2.txt is not too large, then you can read it in memory, and use Bash sub-strings to extract the desired ranges:

data=$(<file2.txt)
while read -r name index _; do
  echo "${data:$index:10}"
done <file1.txt >result.txt

This will be much more efficient than running cut or another process for every single range definition.

(Thanks to @CharlesDuffy for the tip to read data without a useless cat, and the while loop.)

2 Comments

data=$(<file2.txt), to avoid the cost of running an external cat. I agree that this is the best answer if file2.txt is small/short.
file2.txt is a full eukaryotic genome, so it's a large file, but your solution is great for small genomes, like prokariotic ones. Thank you for your suggestion.
1

One way to solve it:

#!/bin/bash                                                                                                        

while read line; do
    pos=$(echo "$line" | cut -f2 -d' ')
    x=$(head -c $(( $pos + 10 )) file2.txt | tail -c 10)
    echo "$x"
done < file1.txt > result.txt

It's not the solution an experienced bash hacker would use, but it is very good for someone who is new to bash. It uses tools that are very versatile, although somewhat bad if you need high performance. Shell scripting is commonly used by people who rarely shell scripts, but knows a few commands and just wants to get the job done. That's why I'm including this solution, even if the other answers are superior for more experienced people.

The first line is pretty easy. It just extracts the numbers from file1.txt. The second line uses the very nice tools head and tail. Usually, they are used with lines instead of characters. Nevertheless, I print the first pos + 10 characters with head. The result is piped into tail which prints the last 10 characters.

Thanks to @CharlesDuffy for improvements.

2 Comments

(I'd also suggest avoiding subshells in an inner loop, especially when they're trivially avoidable; every $( ... ) is a fork() and a wait()).
Since i'm a bash newbie, i really appreciate your solution. It makes sense to me. Thank you!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.