Extract word form string using regex word boundaries in python

Question

Suppose I have such a file name and I want to extract part of it as a string in Python

import re
fn = "DC_QnA_bo_v.15.12.3_DE_duplicates.xlsx"
rgx = re.compile('\b_[A-Z]{2}\b')
print(re.findall(rgx, fn))

Expected out put [DE], but actual out is [].

What is the condition' so yoe want to get 'DE' and not 'DC'? — user11116003
– user11116003, Commented May 21, 2019 at 6:38
Just re.findall(r'_([A-Z]{2})_', fn) will do, no need for any lookarounds — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Jun 5, 2019 at 13:24

Jan · Accepted Answer · 2019-05-21 06:29:51Z

2

You could use

(?<=_)[A-Z]+(?=_)

This makes use of lookarounds on both sides, see a demo on regex101.com. For tighter results, you'd need to specify more sample inputs though.

answered May 21, 2019 at 6:29

Jan

43.3k11 gold badges57 silver badges87 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Rakesh · Accepted Answer · 2019-05-21 06:29:13Z

1

Use _([A-Z]{2})

Ex:

import re
fn = "DC_QnA_bo_v.15.12.3_DE_duplicates.xlsx"
rgx = re.compile('_([A-Z]{2})')
print(rgx.findall(fn))           #You can use the compiled pattern to do findall.

Output:

['DE']

answered May 21, 2019 at 6:29

Rakesh

82.9k17 gold badges85 silver badges122 bronze badges

Comments

Emma Marcier · Accepted Answer · 2019-05-21 06:34:06Z

Your desired output seems to be DE which is in bounded with two _ from left and right. This expression might also work:

# -*- coding: UTF-8 -*-
import re

string = "DC_QnA_bo_v.15.12.3_DE_duplicates.xlsx"
expression = r'_([A-Z]+)_'
match = re.search(expression, string)
if match:
    print("YAAAY! \"" + match.group(1) + "\" is a match 💚💚💚 ")
else: 
    print('🙀 Sorry! No matches!')

Output

YAAAY! "DE" is a match 💚💚💚

Or you can add a 2 quantifier, if you might want:

# -*- coding: UTF-8 -*-
import re

string = "DC_QnA_bo_v.15.12.3_DE_duplicates.xlsx"
expression = r'_([A-Z]{2})_'
match = re.search(expression, string)
if match:
    print("YAAAY! \"" + match.group(1) + "\" is a match 💚💚💚 ")
else: 
    print('🙀 Sorry! No matches!')

DEMO

Michał Turczyn · Accepted Answer · 2019-05-21 06:41:52Z

1

Try pattern: \_([^\_]+)\_[^\_\.]+\.xlsx

Explanation:

\_ - match _ literally

[^\_]+ - negated character class with + operator: match one or more times character other than _

[^\_\.]+ - same as above, but this time match characters other than _ and .

\.xlsx - match .xlsx literally

Demo

The idea is to match last pattern _something_ before extension .xlsx

answered May 21, 2019 at 6:41

Michał Turczyn

41.2k18 gold badges58 silver badges87 bronze badges

Comments

U13-Forward · Accepted Answer · 2019-05-21 06:40:47Z

0

Another re solution:

rgx = re.compile('_([A-Z]{1,})_')
print(re.findall(rgx, fn))

answered May 21, 2019 at 6:40

U13-Forward

71.8k15 gold badges100 silver badges125 bronze badges

Comments

Daweo · Accepted Answer · 2019-05-21 07:11:41Z

0

You could use regular expression (re module) for that as already shown, however this could be done without using any imports, following way:

fn = "DC_QnA_bo_v.15.12.3_DE_duplicates.xlsx"
out = [i for i in fn.split('_')[1:] if len(i)==2 and i.isalpha() and i.isupper()]
print(out) # ['DE']

Explanation: I split fn at _ then discard 1st element and filter elements so only strs of length 2, consisting of letters and consisting of uppercases remain.

answered May 21, 2019 at 7:11

Daweo

38.2k3 gold badges17 silver badges32 bronze badges

Collectives™ on Stack Overflow

Extract word form string using regex word boundaries in python

6 Answers 6

Comments

Comments

Output

DEMO

Comments

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

Comments

Comments

Output

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related