0

I am trying to do a regex on a dataframe.

For example a value will be ia wt template - tdct-c15-c5.doc The best logic I can think of is to take everything after the - till the last digit in the string.

trying to trim it to tdct-c15-c5

any help would be appreciated.

2
  • 1
    (?<=- )[^ ]+(?=\.doc) Commented Feb 3, 2022 at 19:31
  • If this example has the same structure as any other value that you need to trim, I would recommend you that just split by whitespaces the value and get the last position. Otherwise a regex I am not sure how it could help you unless you give more information. Commented Feb 3, 2022 at 19:32

2 Answers 2

2

Components

To stay flexible, assume your input filename(s) contain chunks:

  1. filenames with fix extension .doc (denoting Word files or documents)
  2. some important key (here tdct-c15-c5)
  3. the separator as hyphen possibly surrounded by spaces (here surrounded by spaces -)
  4. some prefix, does not matter currently (here ia wt template)

This information is contained inside ia wt template - tdct-c15-c5.doc.

Decomposition steps

Particularly the chunks (1) and (3) seem pretty stable and fixed constants. So lets work with them:

  1. we can strip-off from right or remove the extension (1) as ignored
  2. we can split the remaining basename by separator (3) into 2 parts: prefix (4) and key (2)

The last part (2) is what we want to extract.

Implementation (pure Python only)

def extract_key(filename):
    basename = filename.rstrip('.doc')
    (prefix, key) = basename.split(' - ')  # or use lenient regex r'\ ?-\ ?'
    return key


filename = 'ia wt template - tdct-c15-c5.doc'
print('extracted key:', extract_key(filename))

Prints:

('extracted key:', 'tdct-c15-c5')

Applied to pandas

Use the function as suggested by C.Nivis inside apply():

df.apply(extract_key)
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks, say I have a dataframe like : df.filename , how do I loop through each row with your function? and store into a new column?
@Jonnyboi, this assumptive dataframe df.filename and the question how to add a new column based on the extracted key is not part of your question. Please also research as instructed in How to Ask, e.g. search SO for [pandas] new column based on other column Would you like to add it there first?
1

I don't know if a regex is the better option here. An apply is pretty readable:

mystr = "ia wt template - tdct-c15-c5.doc"
import pandas as pd

df = pd.DataFrame([[mystr] for i in range(4)], columns=['mystr'])

df.mystr.apply(lambda x: x.split(' ')[-1].rstrip('.doc'))
0    tdct-c15-c5
1    tdct-c15-c5
2    tdct-c15-c5
3    tdct-c15-c5
Name: mystr, dtype: object

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.