Remove specific character strings from text file using sed, no change to output file?

Question

Edit: added block quote text

I have a tab delimited text file (acc.paired.txt) of illumina sample names (head):

SRR10598163_R1.fastq.gz  SRR8916417_R2.fastq.gz
SRR10598049_R1.fastq.gz  SRR10598163_R2.fastq.gz  SRR8916418_R1.fastq.gz
SRR10598049_R2.fastq.gz  SRR10598164_R1.fastq.gz  SRR8916418_R2.fastq.gz
SRR10598050_R1.fastq.gz  SRR10598164_R2.fastq.gz  SRR8916419_R1.fastq.gz
SRR10598050_R2.fastq.gz  SRR10598165_R1.fastq.gz  SRR8916419_R2.fastq.gz
SRR10598051_R1.fastq.gz  SRR10598165_R2.fastq.gz  SRR8916420_R1.fastq.gz
SRR10598051_R2.fastq.gz  SRR10598166_R1.fastq.gz  SRR8916420_R2.fastq.gz
SRR10598052_R1.fastq.gz  SRR10598166_R2.fastq.gz  SRR8916421_R1.fastq.gz
SRR10598052_R2.fastq.gz  SRR10598167_R1.fastq.gz  SRR8916421_R2.fastq.gz
SRR10598053_R1.fastq.gz  SRR10598167_R2.fastq.gz  SRR8916422_R1.fastq.gz
SRR10598053_R2.fastq.gz  SRR10598168_R1.fastq.gz  SRR8916422_R2.fastq.gz
SRR10598054_R1.fastq.gz  SRR10598168_R2.fastq.gz  SRR8916423_R1.fastq.gz

and I'd like to make two changes, 1) remove duplicate sample names and 2) remove all characters after the specific sample name. My goal output is a tab delimited text file which contains just the SRR### numbers (no _R#.fastq.qz) with no duplicates. Example goal output:

SRR10598163
SRR8916417
SRR10598049
SRR8916418
SRR10598164
SRR10598050
SRR8916419
SRR10598165
SRR10598051
SRR8916420
SRR10598166
SRR10598052
SRR8916421
SRR10598167
SRR10598053
SRR8916422
SRR10598054
SRR10598168
SRR8916423

I turned to sed to remove character patterns:

`sed 's| _R1.fastq.gz||g' acc.paired.txt > out.txt`

But out.txt had no changes.

TIA.

Please do not post images of text, but the text itself. It is much easier to work with that. Looks like you have unwanted whitespace in your sed-statement. — markgraf
– markgraf, Commented May 22, 2023 at 17:44
I made the edits as requested, thank you @ilkkachu and markgraf. Also, my text file was constructed by navigating into the directory containing the zipped files and using dir > acc.paired.txt — Geomicro
– Geomicro, Commented May 22, 2023 at 18:32
You got your solution, but if you still like to know why your sed command failed: You seem to have a whitespace before the _R, so it will not match. — Philippos
– Philippos, Commented May 23, 2023 at 5:48

Gilles Quénot · Accepted Answer · 2023-05-22 19:28:22Z

4

Using grep and sort:

grep -oE '\bSR[^_]+' file | sort -u

SRR10598049
SRR10598050
SRR10598051
[...]

The regular expression matches as follows:

Node	Explanation
`\b`	the boundary anchor between a word char (\w) and something that is not a word char anchor
`SR`	'SR'
`[^_]+`	any character except: `_` (1 or more times (matching the most amount possible))

edited May 22, 2023 at 19:28

answered May 22, 2023 at 18:50

Gilles Quénot

36.8k7 gold badges76 silver badges97 bronze badges

Add a comment |

Ed Morton · Accepted Answer · 2023-05-23 19:07:15Z

4

Using GNU awk for multi-char RS plus \s and \S shorthand for [[:space:]] and [^[:space:]]:

$ awk -v RS='_\\S+\\s*' '!seen[$0]++' file
SRR10598163
SRR8916417
SRR10598049
SRR8916418
SRR10598164
SRR10598050
SRR8916419
SRR10598165
SRR10598051
SRR8916420
SRR10598166
SRR10598052
SRR8916421
SRR10598167
SRR10598053
SRR8916422
SRR10598168
SRR10598054
SRR8916423

edited May 23, 2023 at 19:07

answered May 23, 2023 at 18:59

Ed Morton

36k6 gold badges25 silver badges60 bronze badges

Add a comment |

Kusalananda · Accepted Answer · 2023-05-22 19:45:48Z

2

GNU sed command would be like this:

sed 's/\s/\n/g;s/_R[0-9].fastq.gz//g' acc.paired.txt | sort |uniq > out.txt

you can also do it with awk:

awk '{gsub("_R[0-9].fastq.gz","\n", $0)gsub("\n ","\n",$0);gsub("\n$","",$0);print}' acc.paired.txt | sort | uniq > out.txt

the second and third gsub functions are used to remove whitespaces and the last newline

edited May 22, 2023 at 19:45

Kusalananda♦

356k42 gold badges737 silver badges1.1k bronze badges

answered May 22, 2023 at 19:09

grayhatter

1271 gold badge1 silver badge11 bronze badges

1

FYI sort | uniq = sort -u.

Ed Morton
– Ed Morton

2023-05-23 19:11:27 +00:00
Commented May 23, 2023 at 19:11

Add a comment |

ilkkachu · Accepted Answer · 2023-05-22 18:51:16Z

You could

change all spaces to newlines with tr
remove everything that matches _R1.fastq.gz with sed
remove empty lines with grep
and sort the output, removing duplicates with sort:

% < acc.paired.txt tr ' ' '\n'  | sed -e 's/_R.\.fastq\.gz//' | grep . | sort -u
SRR10598049
SRR10598050
SRR10598051
SRR10598052
[...]

Apart from the ordering, the output is the same as shown in your question.

Of course in regexes, . matches any character, and a literal dot is matched with \.. grep . works to keep only lines that contain at least one character, so losing the empty lines the tr created from back-to-back spaces. This also assumes there's just R1, to R9 there, not R11 or so.

Stack Exchange Network

Remove specific character strings from text file using sed, no change to output file?

4 Answers 4

The regular expression matches as follows:

You must log in to answer this question.

Hot Network Questions

Remove specific character strings from text file using sed, no change to output file?

4 Answers 4

The regular expression matches as follows:

You must log in to answer this question.

Related

Hot Network Questions