2

I need to split a string of text into its component words, so I'm using a Regex to split it on the empty spaces (in a Typescript file, btw).

splitIntoWords(text: string) : Array<string> {
    const separator = ' ';
    const words = text.split(new RegExp(separator, 'g'));
    return words;
}

This mostly works, but I've noticed that I regularly get words in the array that still contain spaces. If I copy the text into the Chrome console and split(' ') it I get the correct amount of words, but when I use the variable (even in the console) it invariably fails in some cases. I can't work out what the difference is. This is an example of my text:

"Le coronavirus en France : la décrue se poursuit en réanimation, la reprise économique au cœur des préoccupations. La mise en œuvre du plan de déconfinement élaboré par le gouvernement doit encore faire l’objet, jeudi, d’un « travail de concertation et d’adaptation aux réalités de terrain » avec les responsables et les élus locaux."

The regex never manages to split the substring "économique au" into two components, for instance. Does anyone know why this is happening?

2
  • 2
    have you try text.split(/\s/g) Commented Apr 29, 2020 at 23:04
  • I cannot reproduce your problem. Your code works just fine for me. Commented Apr 30, 2020 at 7:03

1 Answer 1

11

It sounds like the whitespace is occasionally not just a plain space. You can split on all whitespace by using \s for the separator instead, which will match any whitespace, including space characters and tab characters.

const text = "Le coronavirus en France : la décrue se poursuit en réanimation, la reprise économique au cœur des préoccupations. La mise en œuvre du plan de déconfinement élaboré par le gouvernement doit encore faire l’objet, jeudi, d’un « travail de concertation et d’adaptation aux réalités de terrain » avec les responsables et les élus locaux.";
const words = text.split(/\s/);
console.log(words);

Another option would be to use match instead of split, and match non-whitespace characters.

const text = "Le coronavirus en France : la décrue se poursuit en réanimation, la reprise économique au cœur des préoccupations. La mise en œuvre du plan de déconfinement élaboré par le gouvernement doit encore faire l’objet, jeudi, d’un « travail de concertation et d’adaptation aux réalités de terrain » avec les responsables et les élus locaux.";
const words = text.match(/\S+/g);
console.log(words);

Sign up to request clarification or add additional context in comments.

4 Comments

Thanks - this seems to be the prob. I didn't realise there could be different spaces under different circumstances (variable vs. literal text).
This doesn't seem to be the problem. The whitespace between économique and au in the OP's question is just a boring old bog-standard U+0020 SPACE 7-bit ASCII character. In fact, I cannot reproduce the OP's problem at all: if I copy&paste the OP's code and the sample text (and I am very careful to preserve all whitespace exactly as it is in the question), then the OP's code works just fine.
@JörgWMittag It works though. I tried it out. I don't know why there should be a difference between trying this as a literal string ("économique au").split(), and using a variable (text.split()), but there is.
I tried everything till I found this solution, both two worked pretty well.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.