1

Suppose I have such a file name and I want to extract part of it as a string in Python

import re
fn = "DC_QnA_bo_v.15.12.3_DE_duplicates.xlsx"
rgx = re.compile('\b_[A-Z]{2}\b')
print(re.findall(rgx, fn))

Expected out put [DE], but actual out is [].

5
  • 2
    A _ is a word character, so \b won't work Commented May 21, 2019 at 6:27
  • Try _([A-Z]{2}) instead with a capturing group Commented May 21, 2019 at 6:28
  • rgx = re.compile('_([A-Z]{2})')? Commented May 21, 2019 at 6:28
  • What is the condition' so yoe want to get 'DE' and not 'DC'? Commented May 21, 2019 at 6:38
  • Just re.findall(r'_([A-Z]{2})_', fn) will do, no need for any lookarounds Commented Jun 5, 2019 at 13:24

6 Answers 6

2

You could use

(?<=_)[A-Z]+(?=_)

This makes use of lookarounds on both sides, see a demo on regex101.com. For tighter results, you'd need to specify more sample inputs though.

Sign up to request clarification or add additional context in comments.

Comments

1

Use _([A-Z]{2})

Ex:

import re
fn = "DC_QnA_bo_v.15.12.3_DE_duplicates.xlsx"
rgx = re.compile('_([A-Z]{2})')
print(rgx.findall(fn))           #You can use the compiled pattern to do findall. 

Output:

['DE']

Comments

1

Your desired output seems to be DE which is in bounded with two _ from left and right. This expression might also work:

# -*- coding: UTF-8 -*-
import re

string = "DC_QnA_bo_v.15.12.3_DE_duplicates.xlsx"
expression = r'_([A-Z]+)_'
match = re.search(expression, string)
if match:
    print("YAAAY! \"" + match.group(1) + "\" is a match 💚💚💚 ")
else: 
    print('🙀 Sorry! No matches!')

Output

YAAAY! "DE" is a match 💚💚💚

Or you can add a 2 quantifier, if you might want:

# -*- coding: UTF-8 -*-
import re

string = "DC_QnA_bo_v.15.12.3_DE_duplicates.xlsx"
expression = r'_([A-Z]{2})_'
match = re.search(expression, string)
if match:
    print("YAAAY! \"" + match.group(1) + "\" is a match 💚💚💚 ")
else: 
    print('🙀 Sorry! No matches!')

enter image description here

DEMO

Comments

1

Try pattern: \_([^\_]+)\_[^\_\.]+\.xlsx

Explanation:

\_ - match _ literally

[^\_]+ - negated character class with + operator: match one or more times character other than _

[^\_\.]+ - same as above, but this time match characters other than _ and .

\.xlsx - match .xlsx literally

Demo

The idea is to match last pattern _something_ before extension .xlsx

Comments

0

Another re solution:

rgx = re.compile('_([A-Z]{1,})_')
print(re.findall(rgx, fn))

Comments

0

You could use regular expression (re module) for that as already shown, however this could be done without using any imports, following way:

fn = "DC_QnA_bo_v.15.12.3_DE_duplicates.xlsx"
out = [i for i in fn.split('_')[1:] if len(i)==2 and i.isalpha() and i.isupper()]
print(out) # ['DE']

Explanation: I split fn at _ then discard 1st element and filter elements so only strs of length 2, consisting of letters and consisting of uppercases remain.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.