1

I am using this function to check if a string contains multiple white spaces:

def check_multiple_white_spaces(text):
    return "  " in text

and it is usually working fine, but not in this following code:

from bs4 import BeautifulSoup
from string import punctuation

text = "<p>Hello &nbsp; &nbsp; &nbsp;world!!</p>\r\n\r"

text = BeautifulSoup(text, 'html.parser').text
text = ''.join(ch for ch in text if ch not in set(punctuation))
text = text.lower().replace('\n', ' ').replace('\t', '').replace('\r', '')

print check_multiple_white_spaces(text)

The final value of text variable is hello      world , but I don't know why the check_multiple_white_spaces function is returning False instead of True.

How can I fix this?

1
  • 1
    Have a look at what print(repr(text)) shows... after you've run it through the soup Commented Sep 22, 2017 at 8:47

3 Answers 3

3

If you were to print the contents of text using repr(), you will see that it does not contain two consecutive spaces:

'hello \xa0 \xa0 \xa0world '

As a result, your function correctly returns False. This could be fixed by converting the non-break space into a space:

text = text.replace(u'\xa0', u' ')
Sign up to request clarification or add additional context in comments.

Comments

1

First, your function check_multiple_white_spaces cannot really check if there is multiple white spaces as there could be three white spaces or more.

You should use re.search(r"\s{2,}", text).

Second, if you print text, you will find you need to unescape text.

See this answer.

How do I unescape HTML entities in a string in Python 3.1?

2 Comments

It is Python 2.x question. You need to pass re.UNICODE to the re.search method to match all Unicode whitespace chars with \s.
@WiktorStribiżew you are right, I have been migrated to python3 for a long time. Sorry about it.
0

There is no consecutive space in text variable, that’s why check_multiple_white_spaces function return False value.

>>> text
u'hello \xa0 \xa0 \xa0world '
>>> print text
hello      world 

\xa0 is no-break space, non-breakable space (NBSP), hard space. Value os space is 32 and value of non-break space is 160

(u' ', 32)
(u'\xa0', 160)

The character \xa0 is a NO-BREAK SPACE, and the closest ASCII equivalent would of course be a regular space.

Use unidecode module to convert all non-ASCII characters to their closest ASCII equivalent

Demo:

>>> import unidecode
>>> unidecode.unidecode(text)
'hello      world '
>>> "  " in unidecode.unidecode(text)
True

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.