This is a follow-up to a previous question of mine, I identified the problem more clearly and I would need some further suggestions :)
I have a string, resulting from some machine learning algorithm, which generally has the following structure:
- at the beginning and at the end, there can be some lines not containing any characters (except for whitespaces);
- in between, there should be 2 lines, each containing a name (either only the surname, or name and surname, or the initial letter from the name plus the surname...), followed by some numbers and (sometimes) other characters mixed in between the numbers;
- one of the names is generally preceded by a special, non-alphanumeric character (>, >>, @, ...).
Something like this:
Connery 3 5 7 @ 4
>> R. Moore 4 5 67| 5 [
I need to extract the 2 names and the numeric characters, and check if one of the lines starts with the special character, so my output should be:.
name_01 = 'Connery'
digits_01 = [3, 5, 7, 4]
name_02 = 'R. Moore'
digits_02 = [4, 5, 67, 5]
selected_line = 2 (anything indicating that it's the second line)
In the linked original question, I've been suggested to use:
inp = '''Connery 3 5 7 @ 4
>> R. Moore 4 5 67| 5 ['''
lines = inp.split('\n')
for line in lines:
matches = re.findall(r'\w+', line)
print(matches)
which produces a result pretty close to what I want:
['Connery', '3', '5', '7', '4']
['R', 'Moore', '4', '5', '67', '5']
But I would need the first two strings in the second line ('R', 'Moore') to be grouped together (basically, group together all the characters before the digits begin). And, it skips the detection of the special character. Should I somehow fix this output, or can I tackle the problem in a different way altogether?