Python, parse multiple line string extracting characters and digits substring

Question

This is a follow-up to a previous question of mine, I identified the problem more clearly and I would need some further suggestions :)

I have a string, resulting from some machine learning algorithm, which generally has the following structure:

at the beginning and at the end, there can be some lines not containing any characters (except for whitespaces);
in between, there should be 2 lines, each containing a name (either only the surname, or name and surname, or the initial letter from the name plus the surname...), followed by some numbers and (sometimes) other characters mixed in between the numbers;
one of the names is generally preceded by a special, non-alphanumeric character (>, >>, @, ...).

Something like this:

Connery  3 5 7 @  4
>> R. Moore 4 5 67| 5 [

I need to extract the 2 names and the numeric characters, and check if one of the lines starts with the special character, so my output should be:.

name_01 = 'Connery'
digits_01 = [3, 5, 7, 4]
name_02 = 'R. Moore'
digits_02 = [4, 5, 67, 5]
selected_line = 2 (anything indicating that it's the second line)

In the linked original question, I've been suggested to use:

inp = '''Connery  3 5 7 @  4
    >> R. Moore 4 5 67| 5 ['''
lines = inp.split('\n')
for line in lines:
    matches = re.findall(r'\w+', line)
    print(matches)

which produces a result pretty close to what I want:

['Connery', '3', '5', '7', '4']
['R', 'Moore', '4', '5', '67', '5']

But I would need the first two strings in the second line ('R', 'Moore') to be grouped together (basically, group together all the characters before the digits begin). And, it skips the detection of the special character. Should I somehow fix this output, or can I tackle the problem in a different way altogether?

Amadan · Accepted Answer · 2021-10-26 12:32:42Z

1

This is better done in several steps.

# get the whitespace at start and end out
lines = inp.strip().split('\n')
for line in lines:
    # for each line, identify the selection mark, the name, and the mess at the end
    # assuming names can't have numbers in them
    match = re.match(r'^(\W+)?([^\d]+?)\s*([^a-zA-Z]+)$', line.strip())
    if match:
        selected_raw, name, numbers_raw = match.groups()
        # now parse the unprocessed bits
        selected = selected_raw is not None
        numbers = re.findall(r'\d+', numbers_raw)
        print(selected, name, numbers)

# output
False Connery ['3', '5', '7', '4']
True R. Moore ['4', '5', '67', '5']

answered Oct 26, 2021 at 12:32

Amadan

200k23 gold badges252 silver badges321 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

mozway · Accepted Answer · 2021-10-26 12:24:10Z

1

I am not sure which characters you expect, want to keep or remove, but something like the following should work for the example:

inp = '''Connery  3 5 7 @  4
    >> R. Moore 4 5 67| 5 ['''
lines = inp.split('\n')
for line in lines:
    matches = re.findall(r'(?:[a-zA-Z.][a-zA-Z.\s]+[a-zA-Z.])|\w+', line)
    print(matches)

output:

['Connery', '3', '5', '7', '4']
['R. Moore', '4', '5', '67', '5']

NB. I included a-z (lower and upper) and dot, with optional spaces in the middle: [a-zA-Z.][a-zA-Z.\s]+[a-zA-Z.], but you should update to your real need.

answered Oct 26, 2021 at 12:24

mozway

267k13 gold badges56 silver badges106 bronze badges

Comments

user8563312 · Accepted Answer · 2021-10-26 12:28:30Z

1

This would also include the special characters (keep in mind that they are hardcoded, so you have to add missing ones to the regex part [>@]+)

for line in lines:
    matches = re.findall(r'(?:[a-zA-Z.][a-zA-Z.\s]+[a-zA-Z.])|\w+|[>@]+', line)
    print(matches)

answered Oct 26, 2021 at 12:28

user8563312

Collectives™ on Stack Overflow

Python, parse multiple line string extracting characters and digits substring

3 Answers 3

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related