Python reformatting strings based on contents

Question

In a pandas dataframe I have rows with contents in the following format:

1) abc123-Target 4-ufs
2) abc123-target4-ufs
3) geo.4
4) j123T4

All of these should be simply: target 4

So far my cleaning procedure is as follows:

df["point_id"] = df["point_id"].str.lower()
df["point_id"] = df['point_id'].str.replace('^.*?(?=target)', '')

This returns:

1) target 4-ufs
2) target4-ufs
3) geo.14
4) geo.2
5) j123T4

What I believe I need is:

a. Remove anything after the last number in the string, this solves 1
b. If 'target' does not have a space after it add a space, this with the above solves 2
c. If the string ends in a point and a number of any length remove everything before the point (incl. point) and replace with 'target ', this solves 3 and 4
d. If the string ends with a 't' followed by a number of any length remove everything before 't' and replace with 'target ', this solves 5

I'm looking at regex and re but the following is not having effect (add space before the last number)

df["point_id"] = re.sub(r'\D+$', '', df["point_id"])

The fourth bird · Accepted Answer · 2021-07-14 11:34:04Z

1

Reading the rules, you might use 2 capture groups and check for the group values:

\btarget\s*(\d+)|.*[t.](\d+)$

\btarget\s*(\d+) Match target, optional whitespace chars and capture 1+ digits in group 1
| Or
.*[t.] Match 0+ characters followed by either t or a .
(\d+)$ Capture 1+ digits in group 2 at the end of the string

Regex demo | Python demo

Python example:

import re
import pandas as pd

pattern = r"\btarget\s*(\d+)|.*[t.](\d+)$"
strings = [
    "abc123-Target 4-ufs",
    "abc123-target4-ufs",
    "geo.4",
    "j123T4"
]

df = pd.DataFrame(strings, columns=["point_id"])

def change(s):
    m = re.search(pattern, s, re.IGNORECASE)
    return "target " + (m.group(2) if m.group(2) else m.group(1))

df["point_id"] = df["point_id"].apply(change)
print(df)

Output

   point_id
0  target 4
1  target 4
2  target 4
3  target 4

edited Jul 14, 2021 at 11:34

answered Jul 14, 2021 at 11:14

The fourth bird

165k16 gold badges61 silver badges75 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Wiktor Stribiżew · Accepted Answer · 2021-07-14 12:37:53Z

1

You can use

df = pd.DataFrame({'point_id':['abc123-Target 4-ufs','abc123-target4-ufs','geo.4','j123T4']})
df['point_id'] = df['point_id'].str.replace(r'(?i).*Target\s*(\d+).*', r'target \1', regex=True)
df.loc[df['point_id'].str.contains(r'(?i)\w[.t]\d+$'), 'point_id'] = 'target 4'
#    point_id
# 0  target 4
# 1  target 4
# 2  target 4
# 3  target 4

The regex is (?i)Target\s*\d+|\w+[.t]\d+$:

(?i) - case insensitive matching
.* - any 0+ chars other than line break chars, as many as possible
Target\s*(\d+).* - Target, zero or more whitespaces, and one or more digits captured into Group 1
.* - any 0+ chars other than line break chars, as many as possible

The second regex matches

(?i) - case insensitive matching
\w - a word char, then
[.t] - a . or t and then
\d+$ - one or more digits at the end of string.

The second regex is used as a mask, and the values in the point_id column are set to target 4 whenever the pattern matches the regex.

See regex #1 demo and regex #2 demo.

edited Jul 14, 2021 at 12:37

answered Jul 14, 2021 at 11:29

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

2 Comments

Spatial Digger Over a year ago

Unfortunately this answer over writes the target number with target 4 which was only an example. The answer by @The fourth bird gives a complete answer. But thank you for the explanation, appreciated.

Wiktor Stribiżew Over a year ago

@SpatialDigger I updated the answer. Solved in two lines of code.

Collectives™ on Stack Overflow

Python reformatting strings based on contents

2 Answers 2

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related