Python, Regular expression for taking a file name in a string

Question

I have a file with lines that looks like this:

chr5    153584000   153599999   D16073_orphan_reads.fa;709[F18|R11] unkn    1   unkn    2509

chr7    153764000   153775999   D16073_orphan_reads.fa;710[F9|R21],14892_orphan_reads.fa;229[F19|R16]   unkn        1   unkn    2510

chr3    127848000   127871999   B15971_orphan_reads.fa;172[F35|R6],D16157-14_orphan_reads.fa;183[F6|R13],14892_orphan_reads.fa;229[F19|R16],USP19283_orphan_reads.fa;336[F10|R6],D15927-14_orphan_reads.fa;176[F11|R10],1007,1007   46  1007    1658

(...)

I want to create a Regex that takes the fasta file (.fa) name for each line ( sometimes I have more than one file per line).

I would like to end up with something like:

D16073_orphan_reads.fa

D16073_orphan_reads.fa, 14892_orphan_reads.fa

B15971_orphan_reads.fa, D1615714_orphan_reads.fa, 14892_orphan_reads.fa,USP19283_orphan_reads.fa, D15927-14_orphan_reads.fa

I tried:

 pattern= re.search(".+.[.fa]", line)

The problem is that the file names have very irregular names. The only clues are:

-end with .fa

-start after the comma

thanks

Joe Young · Accepted Answer · 2015-09-20 15:25:24Z

1

The regex ([\w-]+\.fa); used in an re.findall() call will accomplish this.

import re

data = '''chr5    153584000   153599999   D16073_orphan_reads.fa;709[F18|R11] unkn    1   unkn    2509

chr7    153764000   153775999   D16073_orphan_reads.fa;710[F9|R21],14892_orphan_reads.fa;229[F19|R16]   unkn        1   unkn    2510

chr3    127848000   127871999   B15971_orphan_reads.fa;172[F35|R6],D16157-14_orphan_reads.fa;183[F6|R13],14892_orphan_reads.fa;229[F19|R16],USP19283_orphan_reads.fa;336[F10|R6],D15927-14_orphan_reads.fa;176[F11|R10],1007,1007   46  1007    1658
'''

for line in data.splitlines():
    filenames = re.findall('([\w|-]+\.fa);', line)
    if filenames:
        print ', '.join(filenames)

output:

D16073_orphan_reads.fa
D16073_orphan_reads.fa, 14892_orphan_reads.fa
B15971_orphan_reads.fa, D16157-14_orphan_reads.fa, 14892_orphan_reads.fa, USP19283_orphan_reads.fa, D15927-14_orphan_reads.fa

edited Sep 20, 2015 at 15:25

answered Sep 20, 2015 at 15:07

Joe Young

5,9153 gold badges31 silver badges28 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Wiktor Stribiżew Over a year ago

The [\w|-] character class matches a single word character (\w), a literal pipe |, or a literal hyphen -. I think you intended to write [\w-].

Joe Young Over a year ago

@stribizhev you're right. Thanks for the correction! Updated the answer to remove the pipe

james jelo4kul · Accepted Answer · 2015-09-20 15:35:17Z

0

Try this pattern ((?=\w+)[\w-]+\.fa)

See demo here https://regex101.com/r/uJ0vD4/3

Explanation

(?=\w+) : checks to see if there are one or more words, if so, match .

[\w-]+ : This is what is captured after, the lookahead. Either one or more word or -

\.fa : .fa is matched after all the conditions have been satisfied

edited Sep 20, 2015 at 15:35

answered Sep 20, 2015 at 15:15

james jelo4kul

8294 silver badges17 bronze badges

3 Comments

Wiktor Stribiżew Over a year ago

Have a look at what your regex will match: D160||||73_orphan_reads|||.fa, 153584000fa, etc.

james jelo4kul Over a year ago

stribizhev, thanks for the correction. So you mean , the lookahead or | isn't necessary?

Wiktor Stribiżew Over a year ago

| is not necessary and a dot must be escaped. At least. I do not know about the lookahead,

Collectives™ on Stack Overflow

Python, Regular expression for taking a file name in a string

2 Answers 2

2 Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related