0

I have a fasta file with headers like this:

612407518| Streptomyces sp. MJ635-86F5 DNA, cremimycin biosynthetic gene cluster, complete sequence
84617315| Streptomyces achromogenes subsp. rubradiris complete rubradirin biosynthetic gene cluster, strain NRRL 3061
345134845| Streptomyces sp. SN-593 DNA, reveromycin biosynthetic gene cluster, complete sequence
323700993| Streptomyces autulyticus strain CGMCC 0516 geldanamycin polyketide biosynthetic gene cluster, complete sequence
15823967| Streptomyces avermitilis oligomycin biosynthetic gene cluster
1408941746| Streptomyces sp. strain OUC6819 rdm biosynthetic gene cluster, complete sequence
315937014| Uncultured organism CA37 glycopeptide biosynthetic gene cluster, complete sequence
29122977| Streptomyces cinnamonensis polyether antibiotic monensin biosynthetic gene cluster, partial sequence
257129259| Moorea producens 19L curacin A biosynthetic gene cluster, partial sequence
166159347| Streptomyces sahachiroi azinomycin B biosynthetic gene cluster, partial sequence

And I want to only keep the one word right before "biosynthetic gene clusters" in the header description, results are like this:

 612407518|cremimycin
 84617315|rubradirin
 345134845|reveromycin
 323700993|polyketide
 15823967|oligomycin
 1408941746|rdm
 315937014|glycopeptide
 29122977|monensin
 257129259|curacin A
 166159347|azinomycin B

Here's what I've tried on my original files with more than 200 headers:

with open("test.txt") as f:
    for line in f:
        (id, name) = line.strip().split('|')
        term_list = name.split()
        term_index = term_list.index('biosynthetic') 

        term = term_list[int(term_index)-1]

        header = id + '|' + term
        print(header)

The result is good, although he last two headers in my example above yield this:

257129259|A
166159347|B

I'll work on the 2nd problem because my original data contain lots of these.

Thank you all for the comments.

3
  • What have you tried so far? Commented Nov 15, 2018 at 20:04
  • 2
    Do you need to use regex? Commented Nov 15, 2018 at 20:11
  • I think a full regex would be like this: (\d*\|)(?:.*)\s(\w+)\s(?=bio), and then group 1 is the number| and group 2 is the word Commented Nov 15, 2018 at 20:17

4 Answers 4

2

A simpler solution than regex would be:

  1. Split the string on "|", assigning the two components to variables id and s.
  2. Split s into words.
  3. Find the position of "biosynthetic" in the resulting list.
  4. Verify that it is followed by "gene" and "clusters".
  5. Print id followed by the word preceding "biosynthetic".

I'ce deliberately not written the code. If you try it and edit your attempt into the question, others will probably respond telling you how to get it wporking (assuming you can't do that on your own).

Good luck!

Sign up to request clarification or add additional context in comments.

Comments

0

Answer not using regex. Will throw ValueError if header is not in the specified format (i.e. always having "biosynthetic gene cluster", always having | deliniate the id, always space before desired word).

id = header[:header.index("|")+1] 
end = header.index(" biosynthetic gene cluster")
word = header[header[:end].rindex(" ")+1:end]
new_title = id + word

Comments

0

You can use Python's str.split() method to get the numbers until the pipe delimiter.

In order to grab the word behind some string you'll probably want to use negative lookahead.

Comments

0

Try regexp: reg = re.match(r'(\d+)\|.* (\w+) biosynthetic gene cluster', txt) then you can use reg.group(1) and reg.group(2)

1 Comment

Or re.findall()

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.