3

I have a file with lines that looks like this:

chr5    153584000   153599999   D16073_orphan_reads.fa;709[F18|R11] unkn    1   unkn    2509

chr7    153764000   153775999   D16073_orphan_reads.fa;710[F9|R21],14892_orphan_reads.fa;229[F19|R16]   unkn        1   unkn    2510

chr3    127848000   127871999   B15971_orphan_reads.fa;172[F35|R6],D16157-14_orphan_reads.fa;183[F6|R13],14892_orphan_reads.fa;229[F19|R16],USP19283_orphan_reads.fa;336[F10|R6],D15927-14_orphan_reads.fa;176[F11|R10],1007,1007   46  1007    1658

(...)

I want to create a Regex that takes the fasta file (.fa) name for each line ( sometimes I have more than one file per line).

I would like to end up with something like:

D16073_orphan_reads.fa

D16073_orphan_reads.fa, 14892_orphan_reads.fa

B15971_orphan_reads.fa, D1615714_orphan_reads.fa, 14892_orphan_reads.fa,USP19283_orphan_reads.fa, D15927-14_orphan_reads.fa

I tried:

 pattern= re.search(".+.[.fa]", line)

The problem is that the file names have very irregular names. The only clues are:

-end with .fa

-start after the comma

thanks

2 Answers 2

1

The regex ([\w-]+\.fa); used in an re.findall() call will accomplish this.

import re

data = '''chr5    153584000   153599999   D16073_orphan_reads.fa;709[F18|R11] unkn    1   unkn    2509

chr7    153764000   153775999   D16073_orphan_reads.fa;710[F9|R21],14892_orphan_reads.fa;229[F19|R16]   unkn        1   unkn    2510

chr3    127848000   127871999   B15971_orphan_reads.fa;172[F35|R6],D16157-14_orphan_reads.fa;183[F6|R13],14892_orphan_reads.fa;229[F19|R16],USP19283_orphan_reads.fa;336[F10|R6],D15927-14_orphan_reads.fa;176[F11|R10],1007,1007   46  1007    1658
'''

for line in data.splitlines():
    filenames = re.findall('([\w|-]+\.fa);', line)
    if filenames:
        print ', '.join(filenames)

output:

D16073_orphan_reads.fa
D16073_orphan_reads.fa, 14892_orphan_reads.fa
B15971_orphan_reads.fa, D16157-14_orphan_reads.fa, 14892_orphan_reads.fa, USP19283_orphan_reads.fa, D15927-14_orphan_reads.fa
Sign up to request clarification or add additional context in comments.

2 Comments

The [\w|-] character class matches a single word character (\w), a literal pipe |, or a literal hyphen -. I think you intended to write [\w-].
@stribizhev you're right. Thanks for the correction! Updated the answer to remove the pipe
0

Try this pattern ((?=\w+)[\w-]+\.fa)

See demo here https://regex101.com/r/uJ0vD4/3

Explanation

(?=\w+) : checks to see if there are one or more words, if so, match .

[\w-]+ : This is what is captured after, the lookahead. Either one or more word or -

\.fa : .fa is matched after all the conditions have been satisfied

3 Comments

Have a look at what your regex will match: D160||||73_orphan_reads|||.fa, 153584000fa, etc.
stribizhev, thanks for the correction. So you mean , the lookahead or | isn't necessary?
| is not necessary and a dot must be escaped. At least. I do not know about the lookahead,

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.