How to partial search for words using regex python

Question

I want to get all 'xlsx' files that somewhere have 'feedback report' in them. I want to make this filter very strong. So any partial matches like 'feedback_report', 'feedback report', 'Feedback Report' should all return true.

Example file names :

ZSS Project_JKIAL-SA_FEEDBACK_REPORT_Jan 29th 2015.xlsx
ZL-SA_feedback report_012844.xlsx
ASARanem-SA_Feedback Report_012844.xlsx

A futile attempt below.

regex = re.compile(r"[a-zA-Z0-0]*[fF][eE][eE][dD][bB][aA][cC][kK]\s[rR][eE][pP][oO][rR][tT][a-zA-Z0-0]*.xlsx")

I'm not a python developer but .*feedback[\s_]report.*\.xlsx seems to be sufficient with the IGNORECASE option. — 41686d6564
– 41686d6564, Commented Aug 30, 2018 at 18:44
Yes you are absolutely correct and it lessens a lot of permutations pointed by everyone on this thread. — technazi
– technazi, Commented Aug 30, 2018 at 18:56

Inder · Accepted Answer · 2018-08-30 18:56:52Z

3

This will work:

re.search("(feedback)(.*?|\s)(report)",string,re.IGNORECASE)

Tested it on the following input list with the code

import re
a=["ZSS Project_JKIAL-SA_FEEDBACK_REPORT_Jan 29th 2015.xlsx",
"ZL-SA_feedback report_012844.xlsx",
"ASARanem-SA_Feedback Report_012844.xlsx",
"some report",
"feedback-report"]

for i in a:
    print(re.search("(feedback)(.*?|\s)(report)",i,re.IGNORECASE))

the output as expected by OP from the same is:

<_sre.SRE_Match object; span=(21, 36), match='FEEDBACK_REPORT'>
<_sre.SRE_Match object; span=(6, 21), match='feedback report'>
<_sre.SRE_Match object; span=(12, 27), match='Feedback Report'>
None
<_sre.SRE_Match object; span=(0, 15), match='feedback-report'>

answered Aug 30, 2018 at 18:56

Inder

3,8369 gold badges30 silver badges43 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Woody1193 · Accepted Answer · 2018-08-30 18:46:46Z

Your regex is nearly acceptable, but the beginning and ending portions will not match correctly because you have underscores in your examples. I'm not sure how representative these are of your actual data but to match what you have here you would need:

regex = re.compile(r"[a-zA-Z0-0\_\-\s]*(feedback)[\s\_\-](report)[a-zA-Z0-0\_\-\s]*.xlsx", 
    flags = re.IGNORECASE)

Another thing you should probably be careful of is to make sure you're actually working with just the file name and not the file path because in that case you'd have to worry about \ and / characters. Also note that I'm only matching for the exact characters I noticed you were missing. You may want to try

regex = re.compile(r"*(feedback)*(report)*.xlsx", flags = re.IGNORECASE)

but, again, I'm not sure what your data actually looks like. Hope this helps

nad_rom · Accepted Answer · 2018-08-30 18:50:44Z

1

First of all, lowercase file names in order to minimize the number of possible options

regex = re.compile('feedback.{0,3}report.*\.xlsx?', flags=re.IGNORECASE)

looks for 'feedback', next up to 3 whatever characters, next 'report', and whatever again, ending with a dot and xls or xlsx extension

or just

filename = 'ZL-SA_feedback report_012844.xlsx'
matched = re.search('feedback.{0,3}report.*\.xlsx?', filename.lower())

Also you can use python glob module to search files in linux fashion:

import glob
glob.glob('*[fF][eE][dD][bB][aA][cC][kK]*[rR][eE][pP][oO][rR][tT]*.xlsx')

edited Aug 30, 2018 at 18:50

answered Aug 30, 2018 at 18:45

nad_rom

4404 silver badges9 bronze badges

Comments

N Chauhan · Accepted Answer · 2018-08-30 18:56:10Z

0

Could you use just string methods like the following?

'feedbackreport' in name.replace('_', '').replace(' ', '').lower()

And also

name.endswith('.xlsx')

Giving you something like:

fileList = [
    'ZSS Project_JKIAL-SA_FEEDBACK_REPORT_Jan 29th 2015.xlsx',
    'ZL-SA_feedback report_012844.xlsx',
    'ASARanem-SA_Feedback Report_012844.xlsx'
]

fileNames = [name for name in fileList
             if ('feedbackreport' in name.replace('_', '').replace(' ', '').lower()
                 and name.endswith('.xlsx'))]

If there are more characters that could cause problems such as - then you could also make a quick function to remove bad characters:

def remove_bad_chars(string, chars): 
    for char in chars:
        string = string.replace(char, '')
    return string

Amending the appropriate portion of the if statement to:

if 'feedbackreport' in remove_bad_chars(name, '.,?!\'-/:;()"\\~ ').lower()
# included a white space in the string of bad characters

edited Aug 30, 2018 at 18:56

answered Aug 30, 2018 at 18:44

N Chauhan

3,5152 gold badges9 silver badges23 bronze badges

1 Comment

Woody1193 Over a year ago

This wouldn't work because strip only removes leading and trailing whitespace

technazi · Accepted Answer · 2018-08-30 18:57:34Z

0

I used this for my string based on all your suggestions. This works for me in 99% of the cases.

regex = re.compile(r"[a-zA-Z0-9\_\-\s]*(feedback)(\s|\_)(report)s?[a-zA-Z0-9\_\-\s]*.xlsx",flags = re.IGNORECASE)

answered Aug 30, 2018 at 18:57

technazi

9784 gold badges22 silver badges44 bronze badges

Collectives™ on Stack Overflow

How to partial search for words using regex python

5 Answers 5

Comments

Comments

Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Comments

Comments

Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related