0

I want to split a string with number, letters and specific white spaces into specific components.

consider the string

ATLANTYS2_I          -           3103 aRNH_profile         -            121   2.7e-35  118.7   0.0   1   1   2.7e-37   5.6e-35  117.7   0.0     2   120  1342  1458  1341  1459 0.98 Gypsy      Arabidopsis thaliana_+1

now let the string be content[3]. I ran the command the

import re 
result = re.split(r'\s{2,}', content[3])

which gave me

['ATLANTYS2_I',
 '-',
 '3103 aRNH_profile',
 '-',
 '121',
 '2.7e-35',
 '118.7',
 '0.0',
 '1',
 '1',
 '2.7e-37',
 '5.6e-35',
 '117.7',
 '0.0',
 '2',
 '120',
 '1342',
 '1458',
 '1341',
 '1459 0.98 Gypsy\tArabidopsis thaliana_+1']

I have split the string by 2 spaces. but the last entry 1459 0.98 Gypsy\tArabidopsis thaliana_+1 is still grouped as one. I thought of splitting the last entry by one space, deleting the last entry in the result and adding the split by one space. However this seems to me rather clunky.

Is there a way to split this elegantly so that I would get the following result for the last entry '1459','0.98', Gypsy\tArabidopsis thaliana_+1'?

3
  • I think you need to split the last entry separately even if it means more code. Better write explicit code than an "elegant" one liner that you won't understand in a month. Commented Jan 15, 2018 at 10:14
  • I agree how would you split the last entry so that I would get the desired result? Commented Jan 15, 2018 at 10:20
  • Thats a string you have defined and you cannot access by list. Refer this link for better understanding : docs.python.org/2/library/string.html Commented Jan 15, 2018 at 10:27

2 Answers 2

1

You could use an alternation:

\s{2,}|\t+
# either two+ whitespaces
# or at least one tabulator space


In Python:

import re

string = "ATLANTYS2_I          -           3103 aRNH_profile         -            121   2.7e-35  118.7   0.0   1   1   2.7e-37   5.6e-35  117.7   0.0     2   120  1342  1458  1341  1459 0.98 Gypsy    Arabidopsis thaliana_+1"

rx = re.compile(r'\s{2,}|\t+')
print(rx.split(string))

Which yields

['ATLANTYS2_I', '-', '3103 aRNH_profile', '-', '121', '2.7e-35', '118.7', '0.0', '1', '1', '2.7e-37', '5.6e-35', '117.7', '0.0', '2', '120', '1342', '1458', '1341', '1459 0.98 Gypsy', 'Arabidopsis thaliana_+1']
Sign up to request clarification or add additional context in comments.

1 Comment

'1459 0.98 Gypsy' must be '1459', '0.98', 'Gypsy'
0

You can process the last element separately:

last_element = result.pop()  # remove last element from list
numbers, plant = last_element.split('\t')  # split on tab
result += numbers.split()  # split the first part on spaces and add it back
result.append(plant)  # add the second part back

Or you could probably use a regex to split that last element correctly

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.