0

In a pandas dataframe I have rows with contents in the following format:

1) abc123-Target 4-ufs
2) abc123-target4-ufs
3) geo.4
4) j123T4

All of these should be simply: target 4

So far my cleaning procedure is as follows:

df["point_id"] = df["point_id"].str.lower()
df["point_id"] = df['point_id'].str.replace('^.*?(?=target)', '')

This returns:

1) target 4-ufs
2) target4-ufs
3) geo.14
4) geo.2
5) j123T4

What I believe I need is:

a. Remove anything after the last number in the string, this solves 1
b. If 'target' does not have a space after it add a space, this with the above solves 2
c. If the string ends in a point and a number of any length remove everything before the point (incl. point) and replace with 'target ', this solves 3 and 4
d. If the string ends with a 't' followed by a number of any length remove everything before 't' and replace with 'target ', this solves 5

I'm looking at regex and re but the following is not having effect (add space before the last number)

df["point_id"] = re.sub(r'\D+$', '', df["point_id"])

2 Answers 2

1

Reading the rules, you might use 2 capture groups and check for the group values:

\btarget\s*(\d+)|.*[t.](\d+)$
  • \btarget\s*(\d+) Match target, optional whitespace chars and capture 1+ digits in group 1
  • | Or
  • .*[t.] Match 0+ characters followed by either t or a .
  • (\d+)$ Capture 1+ digits in group 2 at the end of the string

Regex demo | Python demo

Python example:

import re
import pandas as pd

pattern = r"\btarget\s*(\d+)|.*[t.](\d+)$"
strings = [
    "abc123-Target 4-ufs",
    "abc123-target4-ufs",
    "geo.4",
    "j123T4"
]

df = pd.DataFrame(strings, columns=["point_id"])

def change(s):
    m = re.search(pattern, s, re.IGNORECASE)
    return "target " + (m.group(2) if m.group(2) else m.group(1))

df["point_id"] = df["point_id"].apply(change)
print(df)

Output

   point_id
0  target 4
1  target 4
2  target 4
3  target 4
Sign up to request clarification or add additional context in comments.

Comments

1

You can use

df = pd.DataFrame({'point_id':['abc123-Target 4-ufs','abc123-target4-ufs','geo.4','j123T4']})
df['point_id'] = df['point_id'].str.replace(r'(?i).*Target\s*(\d+).*', r'target \1', regex=True)
df.loc[df['point_id'].str.contains(r'(?i)\w[.t]\d+$'), 'point_id'] = 'target 4'
#    point_id
# 0  target 4
# 1  target 4
# 2  target 4
# 3  target 4

The regex is (?i)Target\s*\d+|\w+[.t]\d+$:

  • (?i) - case insensitive matching
  • .* - any 0+ chars other than line break chars, as many as possible
  • Target\s*(\d+).* - Target, zero or more whitespaces, and one or more digits captured into Group 1
  • .* - any 0+ chars other than line break chars, as many as possible

The second regex matches

  • (?i) - case insensitive matching
  • \w - a word char, then
  • [.t] - a . or t and then
  • \d+$ - one or more digits at the end of string.

The second regex is used as a mask, and the values in the point_id column are set to target 4 whenever the pattern matches the regex.

See regex #1 demo and regex #2 demo.

2 Comments

Unfortunately this answer over writes the target number with target 4 which was only an example. The answer by @The fourth bird gives a complete answer. But thank you for the explanation, appreciated.
@SpatialDigger I updated the answer. Solved in two lines of code.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.