4

I want to get all 'xlsx' files that somewhere have 'feedback report' in them. I want to make this filter very strong. So any partial matches like 'feedback_report', 'feedback report', 'Feedback Report' should all return true.

Example file names :

  1. ZSS Project_JKIAL-SA_FEEDBACK_REPORT_Jan 29th 2015.xlsx
  2. ZL-SA_feedback report_012844.xlsx
  3. ASARanem-SA_Feedback Report_012844.xlsx

A futile attempt below.

regex = re.compile(r"[a-zA-Z0-0]*[fF][eE][eE][dD][bB][aA][cC][kK]\s[rR][eE][pP][oO][rR][tT][a-zA-Z0-0]*.xlsx")
2
  • 1
    I'm not a python developer but .*feedback[\s_]report.*\.xlsx seems to be sufficient with the IGNORECASE option. Commented Aug 30, 2018 at 18:44
  • Yes you are absolutely correct and it lessens a lot of permutations pointed by everyone on this thread. Commented Aug 30, 2018 at 18:56

5 Answers 5

3

This will work:

re.search("(feedback)(.*?|\s)(report)",string,re.IGNORECASE)

Tested it on the following input list with the code

import re
a=["ZSS Project_JKIAL-SA_FEEDBACK_REPORT_Jan 29th 2015.xlsx",
"ZL-SA_feedback report_012844.xlsx",
"ASARanem-SA_Feedback Report_012844.xlsx",
"some report",
"feedback-report"]

for i in a:
    print(re.search("(feedback)(.*?|\s)(report)",i,re.IGNORECASE))

the output as expected by OP from the same is:

<_sre.SRE_Match object; span=(21, 36), match='FEEDBACK_REPORT'>
<_sre.SRE_Match object; span=(6, 21), match='feedback report'>
<_sre.SRE_Match object; span=(12, 27), match='Feedback Report'>
None
<_sre.SRE_Match object; span=(0, 15), match='feedback-report'>
Sign up to request clarification or add additional context in comments.

Comments

1

Your regex is nearly acceptable, but the beginning and ending portions will not match correctly because you have underscores in your examples. I'm not sure how representative these are of your actual data but to match what you have here you would need:

regex = re.compile(r"[a-zA-Z0-0\_\-\s]*(feedback)[\s\_\-](report)[a-zA-Z0-0\_\-\s]*.xlsx", 
    flags = re.IGNORECASE)

Another thing you should probably be careful of is to make sure you're actually working with just the file name and not the file path because in that case you'd have to worry about \ and / characters. Also note that I'm only matching for the exact characters I noticed you were missing. You may want to try

regex = re.compile(r"*(feedback)*(report)*.xlsx", flags = re.IGNORECASE)

but, again, I'm not sure what your data actually looks like. Hope this helps

Comments

1

First of all, lowercase file names in order to minimize the number of possible options

regex = re.compile('feedback.{0,3}report.*\.xlsx?', flags=re.IGNORECASE)

looks for 'feedback', next up to 3 whatever characters, next 'report', and whatever again, ending with a dot and xls or xlsx extension

or just

filename = 'ZL-SA_feedback report_012844.xlsx'
matched = re.search('feedback.{0,3}report.*\.xlsx?', filename.lower())

Also you can use python glob module to search files in linux fashion:

import glob
glob.glob('*[fF][eE][dD][bB][aA][cC][kK]*[rR][eE][pP][oO][rR][tT]*.xlsx')

Comments

0

Could you use just string methods like the following?

'feedbackreport' in name.replace('_', '').replace(' ', '').lower()

And also

name.endswith('.xlsx')

Giving you something like:

fileList = [
    'ZSS Project_JKIAL-SA_FEEDBACK_REPORT_Jan 29th 2015.xlsx',
    'ZL-SA_feedback report_012844.xlsx',
    'ASARanem-SA_Feedback Report_012844.xlsx'
]

fileNames = [name for name in fileList
             if ('feedbackreport' in name.replace('_', '').replace(' ', '').lower()
                 and name.endswith('.xlsx'))]

If there are more characters that could cause problems such as - then you could also make a quick function to remove bad characters:

def remove_bad_chars(string, chars): 
    for char in chars:
        string = string.replace(char, '')
    return string

Amending the appropriate portion of the if statement to:

if 'feedbackreport' in remove_bad_chars(name, '.,?!\'-/:;()"\\~ ').lower()
# included a white space in the string of bad characters

1 Comment

This wouldn't work because strip only removes leading and trailing whitespace
0

I used this for my string based on all your suggestions. This works for me in 99% of the cases.

regex = re.compile(r"[a-zA-Z0-9\_\-\s]*(feedback)(\s|\_)(report)s?[a-zA-Z0-9\_\-\s]*.xlsx",flags = re.IGNORECASE)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.