Keep substring from a long string in python?

Question

I have a fasta file with headers like this:

612407518| Streptomyces sp. MJ635-86F5 DNA, cremimycin biosynthetic gene cluster, complete sequence
84617315| Streptomyces achromogenes subsp. rubradiris complete rubradirin biosynthetic gene cluster, strain NRRL 3061
345134845| Streptomyces sp. SN-593 DNA, reveromycin biosynthetic gene cluster, complete sequence
323700993| Streptomyces autulyticus strain CGMCC 0516 geldanamycin polyketide biosynthetic gene cluster, complete sequence
15823967| Streptomyces avermitilis oligomycin biosynthetic gene cluster
1408941746| Streptomyces sp. strain OUC6819 rdm biosynthetic gene cluster, complete sequence
315937014| Uncultured organism CA37 glycopeptide biosynthetic gene cluster, complete sequence
29122977| Streptomyces cinnamonensis polyether antibiotic monensin biosynthetic gene cluster, partial sequence
257129259| Moorea producens 19L curacin A biosynthetic gene cluster, partial sequence
166159347| Streptomyces sahachiroi azinomycin B biosynthetic gene cluster, partial sequence

And I want to only keep the one word right before "biosynthetic gene clusters" in the header description, results are like this:

 612407518|cremimycin
 84617315|rubradirin
 345134845|reveromycin
 323700993|polyketide
 15823967|oligomycin
 1408941746|rdm
 315937014|glycopeptide
 29122977|monensin
 257129259|curacin A
 166159347|azinomycin B

Here's what I've tried on my original files with more than 200 headers:

with open("test.txt") as f:
    for line in f:
        (id, name) = line.strip().split('|')
        term_list = name.split()
        term_index = term_list.index('biosynthetic') 

        term = term_list[int(term_index)-1]

        header = id + '|' + term
        print(header)

The result is good, although he last two headers in my example above yield this:

257129259|A
166159347|B

I'll work on the 2nd problem because my original data contain lots of these.

Thank you all for the comments.

I think a full regex would be like this: (\d*\|)(?:.*)\s(\w+)\s(?=bio), and then group 1 is the number| and group 2 is the word — MegaBluejay
– MegaBluejay, Commented Nov 15, 2018 at 20:17

holdenweb · Accepted Answer · 2018-11-15 20:14:55Z

2

A simpler solution than regex would be:

Split the string on "|", assigning the two components to variables id and s.
Split s into words.
Find the position of "biosynthetic" in the resulting list.
Verify that it is followed by "gene" and "clusters".
Print id followed by the word preceding "biosynthetic".

I'ce deliberately not written the code. If you try it and edit your attempt into the question, others will probably respond telling you how to get it wporking (assuming you can't do that on your own).

Good luck!

answered Nov 15, 2018 at 20:14

holdenweb

37.8k7 gold badges62 silver badges80 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Charlie G · Accepted Answer · 2018-11-15 20:24:58Z

0

Answer not using regex. Will throw ValueError if header is not in the specified format (i.e. always having "biosynthetic gene cluster", always having | deliniate the id, always space before desired word).

id = header[:header.index("|")+1] 
end = header.index(" biosynthetic gene cluster")
word = header[header[:end].rindex(" ")+1:end]
new_title = id + word

answered Nov 15, 2018 at 20:24

Charlie G

5545 silver badges16 bronze badges

Comments

akim · Accepted Answer · 2018-11-15 20:26:17Z

0

You can use Python's str.split() method to get the numbers until the pipe delimiter.

In order to grab the word behind some string you'll probably want to use negative lookahead.

answered Nov 15, 2018 at 20:26

akim

263 bronze badges

Comments

vezunchik · Accepted Answer · 2018-11-15 20:29:41Z

0

Try regexp: reg = re.match(r'(\d+)\|.* (\w+) biosynthetic gene cluster', txt) then you can use reg.group(1) and reg.group(2)

answered Nov 15, 2018 at 20:29

vezunchik

3,7173 gold badges20 silver badges26 bronze badges

1 Comment

soundstripe Over a year ago

Or re.findall()

Collectives™ on Stack Overflow

Keep substring from a long string in python?

4 Answers 4

Comments

Comments

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related