Regular expression for matching non-whitespace in Python

Question

I want to use re.search to extract the first set of non-whitespace characters. I have the following pseudoscript that recreates my problem:

#!/usr/bin/env python2.7
import re

line = "STARC-1.1.1.5             ConsCase    WARNING    Warning"
m = re.search('^[^\S]*?',line)
if m:
    print m.group(0)

It seems to be printing the whitespace instead of STARC-1.1.1.5

So far as I understand it, this regular expression is saying: At the start of the line, find a set of nonwhitespace characters, don't be greedy

I was pretty sure this would work, the documentation says I can use /S to match whitespace in [], so i'm not sure where the issue is.

Now, I know, I know this probably looks weird, why aren't I using some other function to do this? Well, there's more than one way to skin a cat and i'm still getting the hang of regular expressions in Python so I'd like to know how I can use re.search to extract this field in this fashion.

@melpomene re is greedy. it wont split on an empty string here — e4c5
– e4c5, Commented Jan 5, 2017 at 12:05
@e4c5 I tried that and got FutureWarning: split() requires a non-empty pattern match. With \s+ I didn't get a warning. — melpomene
– melpomene, Commented Jan 5, 2017 at 12:10
@melpomene i also tried it in python 2.7 with ipython and got the desired result — e4c5
– e4c5, Commented Jan 5, 2017 at 12:11
My test was with 3.5.2. I also got the desired result in both cases, but only \s+ didn't trigger a warning in re.py:203. — melpomene
– melpomene, Commented Jan 5, 2017 at 12:12

Wiktor Stribiżew · Accepted Answer · 2017-01-05 12:06:35Z

17

The [^\S] is a negated character class that is equal to \s (whitespace pattern). The *? is a lazy quantifier that matches zero or more characters, but as few as possible, and when used at the end of the pattern never actually matches any characters.

Replace you m = re.search('^[^\S]*?',line) line with

m = re.match(r'\S+',line)

or - if you want to also allow an empty string match:

m = re.match(r'\S*',line)

The re.match method anchors the pattern at the start of the string. With re.search, you need to keep the ^ anchor at the start of the pattern:

m = re.search(r'^\S+',line)

See the Python demo:

import re
line = "STARC-1.1.1.5             ConsCase    WARNING    Warning"
m = re.search('^\S+',line)
if m:
    print m.group(0)
# => STARC-1.1.1.5

However, here, in this case, you may just use a mere split():

res = line.split() 
print(res[0])

See another Python demo.

edited Jan 5, 2017 at 12:06

answered Jan 5, 2017 at 12:00

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

melpomene · Accepted Answer · 2017-01-05 12:01:49Z

8

\s matches a whitespace character.

\S matches a non-whitespace character.

[...] matches a character in the set ....

[^...] matches a character not in the set ....

[^\S] matches a character that is not a non-whitespace character, i.e. it matches a whitespace character.

answered Jan 5, 2017 at 12:01

melpomene

86.2k8 gold badges96 silver badges155 bronze badges

Comments

NEO MED · Accepted Answer · 2020-07-31 13:36:25Z

0

import re
line = "STARC-1.1.1.5             ConsCase    WARNING    Warning"
m = re.search('S.+[0-9]',line)
print(m.group(0))

The re.search returns the match, so use the alphabets and numbers and print the match as mentioned in the code. If you print only the variable it prints it as match 1. Hope this answers your question

m = re.search('[A-Z].+[0-9]',line)

Changing the re.search to the capital letters will find from CAPS A to Z, vice vers if you change it to small letters as

m = re.search('[a-z].+[0-9]',line)

it will find only small letters, sometimes you should highlight the symbols too, to search from it or to search upto the characters before that symbol.

answered Jul 31, 2020 at 13:36

NEO MED

213 silver badges6 bronze badges

2 Comments

Sergey Shubin Over a year ago

It looks like the author of the question wanted to extract the first set of any non-whitespace characters. This solution assumes that all the extracted string begin with alphabet characters and end with numeric characters. Though the author's example match this pattern the question is about any non-whitespace characters.

Ivo Mori Over a year ago

Instead of posting a second, extended answer it'd be better to simply edit your first answer to include the additional information. Also note that the original comment from @Toto: (Re)read the question: I want to use re.search to extract the first set of non-whitespace characters. still applies. Your suggested regular expression matches OP's example STARC-1.1.1.5 but is doesn't match first set of non-whitespace characters.

NEO MED · Accepted Answer · 2020-08-02 06:26:36Z

0

Replace your re.search as below, \S finds non-whitespace character, and + searches for one or more times. Python starts to search from first character.

import re
line = "STARC-1.1.1.5             ConsCase    WARNING    Warning"
m = re.search('\S+',line)
print(m.group(0))

edited Aug 2, 2020 at 6:26

answered Jul 26, 2020 at 11:58

NEO MED

213 silver badges6 bronze badges

2 Comments

Toto Over a year ago

Code whithout explanation is useless. And what question are you answering?

Toto Over a year ago

(Re)read the question: I want to use re.search to extract the first set of non-whitespace characters.

Collectives™ on Stack Overflow

Regular expression for matching non-whitespace in Python

4 Answers 4

Comments

Comments

2 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

Comments

2 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related