python regex text extraction

Question

the text input is something like this West Team 4, Eastern 3\n

-------Update--------

the input is a txt file containing team name and scores like a football game the whole text file will be something like this, two names and scores:

West Team 4, Eastern 5
Nott Team 2, Eastern 3
West wood 1, Eathan 2
West Team 4, Eas 5

I am using with open to read file line by line therefore there will be \n at the end of the line.

I would like to extract this line of text in to something like:

['West Team', 'Eastern']

What I currently have in mind is to use regex

result = re.sub("[\n^\s$\d]", "", text).split(",")

this code results in this:

['WestTeam','Eastern']

I'm sure that my regex is not correct. I want to remove '\n' and any number including the space in front of the number but not the space in the middle of the name.

Open to any suggestion that to achieve this result, doesn't necessarily use regex.

You really need to define the "rules" that describe your input and output data. Your input looks as though it may be comma-delimited where each token (split by comma) ends with a number that you want to remove. If that's the case you really don't need RE — jackal
– jackal, Commented Feb 9, 2022 at 10:35
Have you checked the solutions here? One of them does not require the regex usage and seems just what you want unless you want to clarify the requirements. Or do you want something like re.findall(r',?\s*(\D*[^\d\s])', text)? — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Feb 10, 2022 at 10:39

JvdV · Accepted Answer · 2022-02-09 11:08:30Z

1

So many ways this can be done, but looking at your data you could use rstrip() quite nicely:

s = 'West Team 4, Eastern 3\n'
lst = [x.rstrip('\n 0123456789') for x in s.split(', ')]
print(lst)

Or maybe rather use:

from string import digits
s = 'West Team 4, Eastern 3\n'
lst = [x.rstrip(digits+'\n ') for x in s.split(', ')]
print(lst)

Both options print:

['West Team', 'Eastern']

edited Feb 9, 2022 at 11:08

answered Feb 9, 2022 at 11:03

JvdV

76.8k8 gold badges48 silver badges89 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Wiktor Stribiżew · Accepted Answer · 2022-02-10 10:46:20Z

1

You can use a non-regex approach to keep any letters/spaces after splitting with a comma:

text = "West Team 4, Eastern 3\n"
print( ["".join(c for c in x if c.isalpha() or c.isspace()).strip() for x in text.split(',')]  )
# => ['West Team', 'Eastern']

Or a regex approach to remove any chars other than ASCII letters and spaces matched with the [^a-zA-Z\s]+ pattern:

import re
rx = re.compile(r'[^a-zA-Z\s]+')
print( [rx.sub("", x).strip() for x in text.split(',')]  )
# => ['West Team', 'Eastern']

Another similar solution can be used to extract one or more non-digit char chunks after an optional comma + whitespaces:

print(re.findall(r',?\s*(\D*[^\d\s])', text))

See the Python demo.

In case there are consecutive non-letter chunks you can use

import re
text = "West Team 4, Eastern 3\n, test 23 99 test"
rx = re.compile(r'[^\W\d_]+')
print( [" ".join(rx.findall(x)) for x in text.split(',')]  )

See the Python demo yielding ['West Team', 'Eastern', 'test test']. The [^\W\d_]+ pattern matches any one or more Unicode letters.

edited Feb 10, 2022 at 10:46

answered Feb 9, 2022 at 10:32

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

1 Comment

hc_dev Over a year ago

This assumes the input is a CSV. It splits into separate values and treats each of those strings with a cleaning: (a) filter only alpha and space characters (excludes numbers), then (b) trim or strip-off the leading/trailing whitespaces.

Tim Biegeleisen · Accepted Answer · 2022-02-09 10:31:16Z

0

Actually re.findall might work well here:

inp = "West Team 4, Eastern 3\n"
matches = re.findall(r'(\w+(?: \w+)*) \d+', inp)
print(matches)  # ['West Team', 'Eastern']

The split version, using re.split:

inp = "West Team 4, Eastern 3\n"
matches = [x for x in re.split(r'\s+\d+\s*,?\s*', inp) if x != '']
print(matches)  # ['West Team', 'Eastern']

answered Feb 9, 2022 at 10:31

Tim Biegeleisen

526k32 gold badges323 silver badges399 bronze badges

Comments

Dolan · Accepted Answer · 2022-02-09 10:33:03Z

0

import re

text = 'West Team 4, Eastern 3\n'

result = re.sub("[\n^$\d]", "", text).split(",")

# REMOVE THE LEADING AND TRAILING SPACES:
result = [x.strip() for x in result]
print(result)
# result: ['West Team', 'Eastern']

answered Feb 9, 2022 at 10:33

Dolan

491 silver badge7 bronze badges

Comments

hc_dev · Accepted Answer · 2022-02-09 10:49:06Z

0

You want to:

remove '\n' and
any number including the space in front of the number
but not the space in the middle of the name.

Functions to use:

for constant parts you could just replace using str.replace().
for all dynamic matches we need a regex to substitute with empty-string using re.sub().
for surroundings we can even use str.strip() to remove leading and trailing whitespaces like \n.

Code

import re

input = "West Team 4, Eastern 3\n"

cleaned = re.sub(r'\s+\d', '', input)  # remove numbers with leading spaces
cleaned = cleaned.strip()  # remove surrounding whitespace like \n
print(cleaned)

output = cleaned.split(",") 
print(output)

Prints:

West Team, Eastern
['West Team', 'Eastern']

edited Feb 9, 2022 at 10:49

answered Feb 9, 2022 at 10:32

hc_dev

9,6941 gold badge30 silver badges47 bronze badges

1 Comment

bfontaine Over a year ago

OP also wants to split on the comma.

The fourth bird · Accepted Answer · 2022-02-09 10:50:32Z

0

You can remove the digits and replace possible double spaced gaps with a single space.

Then split on a comma, do not keep empty values and trim the output:

import re

s = "West Team 4 , Eastern 3, test 23 99 test\n,"

res = [
    m.strip() for m in re.sub(r"[^\S\n]{2,}", " ", re.sub(r"\d+", "", s)).split(",") if m
]
print(res)

Output

['West Team', 'Eastern', 'test test']

See a Python demo.

answered Feb 9, 2022 at 10:50

The fourth bird

165k16 gold badges61 silver badges75 bronze badges

Comments

jackal · Accepted Answer · 2022-02-09 10:59:51Z

0

You haven't clearly defined the rules for getting the required output from your sample input. However, this will give what you've asked for but may not cover all eventualities:

in_string = 'West Team 4, Eastern 3\n'

result = [' '.join(t.split()[:-1]) for t in in_string.split(',')]

print(result)

Output:

['West Team', 'Eastern']

edited Feb 9, 2022 at 10:59

answered Feb 9, 2022 at 10:38

jackal

29.1k3 gold badges9 silver badges27 bronze badges

5 Comments

jackal Over a year ago

@JvdV because that would not produce the desired result

jackal Over a year ago

@JvdV No it doesn't. That produces ['West Team 4', 'Eastern 3']. Spot the difference

jackal Over a year ago

@JvdV No. It does not assume that. It assumes that there are strings delimited by comma and that each of those strings has an unwanted whitespace delimited token at the end of that string. You could replace '4' with 'four' and that would also be removed. Having said that, the OP hasn't fully defined the requirement which is why I've already said that this may not cover all eventualities.

jackal Over a year ago

@JvdV Well spotted. Fixed

JvdV Over a year ago

Jup you got my vote back. Nice solution!

Collectives™ on Stack Overflow

python regex text extraction

7 Answers 7

Comments

1 Comment

Comments

Comments

Code

1 Comment

Comments

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

7 Answers 7

Comments

1 Comment

Comments

Comments

Code

1 Comment

Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related