Components
To stay flexible, assume your input filename(s) contain chunks:
- filenames with fix extension
.doc (denoting Word files or documents)
- some important key (here
tdct-c15-c5)
- the separator as hyphen possibly surrounded by spaces (here surrounded by spaces
-)
- some prefix, does not matter currently (here
ia wt template)
This information is contained inside ia wt template - tdct-c15-c5.doc.
Decomposition steps
Particularly the chunks (1) and (3) seem pretty stable and fixed constants.
So lets work with them:
- we can strip-off from right or remove the extension (1) as ignored
- we can split the remaining basename by separator (3) into 2 parts: prefix (4) and key (2)
The last part (2) is what we want to extract.
Implementation (pure Python only)
def extract_key(filename):
basename = filename.rstrip('.doc')
(prefix, key) = basename.split(' - ') # or use lenient regex r'\ ?-\ ?'
return key
filename = 'ia wt template - tdct-c15-c5.doc'
print('extracted key:', extract_key(filename))
Prints:
('extracted key:', 'tdct-c15-c5')
Applied to pandas
Use the function as suggested by C.Nivis inside apply():
df.apply(extract_key)
(?<=- )[^ ]+(?=\.doc)