Python - Check multiple white spaces in string

Question

I am using this function to check if a string contains multiple white spaces:

def check_multiple_white_spaces(text):
    return "  " in text

and it is usually working fine, but not in this following code:

from bs4 import BeautifulSoup
from string import punctuation

text = "<p>Hello &nbsp; &nbsp; &nbsp;world!!</p>\r\n\r"

text = BeautifulSoup(text, 'html.parser').text
text = ''.join(ch for ch in text if ch not in set(punctuation))
text = text.lower().replace('\n', ' ').replace('\t', '').replace('\r', '')

print check_multiple_white_spaces(text)

The final value of text variable is hello world , but I don't know why the check_multiple_white_spaces function is returning False instead of True.

How can I fix this?

Have a look at what print(repr(text)) shows... after you've run it through the soup — Jon Clements
– Jon Clements, Commented Sep 22, 2017 at 8:47

Martin Evans · Accepted Answer · 2017-09-22 08:58:45Z

3

If you were to print the contents of text using repr(), you will see that it does not contain two consecutive spaces:

'hello \xa0 \xa0 \xa0world '

As a result, your function correctly returns False. This could be fixed by converting the non-break space into a space:

text = text.replace(u'\xa0', u' ')

edited Sep 22, 2017 at 8:58

answered Sep 22, 2017 at 8:48

Martin Evans

46.9k17 gold badges88 silver badges104 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Sraw · Accepted Answer · 2017-09-22 08:48:14Z

1

First, your function check_multiple_white_spaces cannot really check if there is multiple white spaces as there could be three white spaces or more.

You should use re.search(r"\s{2,}", text).

Second, if you print text, you will find you need to unescape text.

See this answer.

How do I unescape HTML entities in a string in Python 3.1?

answered Sep 22, 2017 at 8:48

Sraw

20.6k11 gold badges61 silver badges93 bronze badges

2 Comments

Wiktor Stribiżew Over a year ago

It is Python 2.x question. You need to pass re.UNICODE to the re.search method to match all Unicode whitespace chars with \s.

Sraw Over a year ago

@WiktorStribiżew you are right, I have been migrated to python3 for a long time. Sorry about it.

Vivek Sable · Accepted Answer · 2017-09-22 09:00:47Z

0

There is no consecutive space in text variable, that’s why check_multiple_white_spaces function return False value.

>>> text
u'hello \xa0 \xa0 \xa0world '
>>> print text
hello      world

\xa0 is no-break space, non-breakable space (NBSP), hard space. Value os space is 32 and value of non-break space is 160

(u' ', 32)
(u'\xa0', 160)

The character \xa0 is a NO-BREAK SPACE, and the closest ASCII equivalent would of course be a regular space.

Use unidecode module to convert all non-ASCII characters to their closest ASCII equivalent

Demo:

>>> import unidecode
>>> unidecode.unidecode(text)
'hello      world '
>>> "  " in unidecode.unidecode(text)
True

edited Sep 22, 2017 at 9:00

answered Sep 22, 2017 at 8:54

Vivek Sable

10.3k6 gold badges45 silver badges63 bronze badges

Collectives™ on Stack Overflow

Python - Check multiple white spaces in string

3 Answers 3

Comments

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related