Create array of words from a string of text

Question

I would like to split a text into single words using PHP. Do you have any idea how to achieve this?

My approach:

function tokenizer($text) {
    $text = trim(strtolower($text));
    $punctuation = '/[^a-z0-9äöüß-]/';
    $result = preg_split($punctuation, $text, -1, PREG_SPLIT_NO_EMPTY);
    for ($i = 0; $i < count($result); $i++) {
        $result[$i] = trim($result[$i]);
    }
    return $result; // contains the single words
}
$text = 'This is an example text, it contains commas and full-stops. Exclamation marks, too! Question marks? All punctuation marks you know.';
print_r(tokenizer($text));

Is this a good approach? Do you have any idea for improvement?

Thanks in advance!

moinudin · Accepted Answer · 2009-04-26 11:20:31Z

31

Use the class \p{P} which matches any unicode punctuation character, combined with the \s whitespace class.

$result = preg_split('/((^\p{P}+)|(\p{P}*\s+\p{P}*)|(\p{P}+$))/', $text, -1, PREG_SPLIT_NO_EMPTY);

This will split on a group of one or more whitespace characters, but also suck in any surrounding punctuation characters. It also matches punctuation characters at the beginning or end of the string. This discriminates cases such as "don't" and "he said 'ouch!'"

edited Apr 26, 2009 at 11:20

answered Apr 26, 2009 at 10:24

moinudin

139k45 gold badges195 silver badges219 bronze badges

Sign up to request clarification or add additional context in comments.

10 Comments

Peter Perháč Over a year ago

+1, not sure, tho, how this will deal with äöüß. Does regex normally classify äöüß as word characters?

caw Over a year ago

Thank you. This would't probably work for English texts but I also want to extract German umlauts (ä, ö, ü), the "ß" and numbers in a string. The "\W" wouldn't extract "Fri3nd", would it?

moinudin Over a year ago

Seems it does not, but updated answer with something similar that works.

moinudin Over a year ago

Updated answer works with perl (which php regex are based on): $ echo "äöüß, test" | perl -e 'while (<>) { if (/([\p{P}\s]+)/) { print "$1\n"; } }' ,

Eugene Yokota Over a year ago

Should one split don't into don and t?

|

Community · Accepted Answer · 2023-11-17 20:38:18Z

14

Tokenize - strtok.

<?php
$text = 'This is an example text, it contains commas and full stops. Exclamation marks, too! Question marks? All punctuation marks you know.';
$delim = ' \n\t,.!?:;';

$tok = strtok($text, $delim);

while ($tok !== false) {
    echo "Word=$tok<br />";
    $tok = strtok($delim);
}
?>

edited Nov 17, 2023 at 20:38

CommunityBot

11 silver badge

answered Apr 26, 2009 at 10:23

Eugene Yokota

95.9k45 gold badges219 silver badges323 bronze badges

4 Comments

moinudin Over a year ago

This won't work if you get a : or ; or any other punctuation character you haven't accounted for.

Eugene Yokota Over a year ago

@marcog, I added : and ;. Doesn't {P} catch apostrophe and hyphen?

moinudin Over a year ago

What about cases such quoting? My updated answer discriminates between these cases.

Oleksiy M. Over a year ago

Excellent idea. Added +1. The only thing is that there should be double quotes around $delim = " \n\t,.!?:;"; With the single quotes it does not work correctly, it splits by the letter n too.

Community · Accepted Answer · 2017-05-23 12:10:03Z

3

I would first make the string to lower-case before splitting it up. That would make the i modifier and the array processing afterwards unnecessary. Additionally I would use the \W shorthand for non-word characters and add a + multiplier.

$text = 'This is an example text, it contains commas and full stops. Exclamation marks, too! Question marks? All punctuation marks you know.';
$result = preg_split('/\W+/', strtolower($text), -1, PREG_SPLIT_NO_EMPTY);

Edit Use the Unicode character properties instead of \W as marcog suggested. Something like [\p{P}\p{Z}] (punctuation and separator characters) would cover the characters more specific than \W.

edited May 23, 2017 at 12:10

CommunityBot

11 silver badge

answered Apr 26, 2009 at 10:35

Gumbo

657k112 gold badges792 silver badges852 bronze badges

2 Comments

caw Over a year ago

Thanks, the idea to perform strtolower() before is very good. I'll use this.

mickmackusa Over a year ago

What purpose does strtolower() serve if you are splitting with \W? Do you want to add a u pattern modifier? A note to researchers... \W will not match an underscore.

farzad · Accepted Answer · 2009-04-26 10:29:45Z

1

you can also use PHP strtok() function to fetch string tokens from your large string. you can use it like this:

 $result = array();
 // your original string
 $text = 'This is an example text, it contains commas and full stops. Exclamation marks, too! Question marks? All punctuation marks you know.';
 // you pass strtok() your string, and a delimiter to specify how tokens are separated. words are seperated by a space.
 $word = strtok($text,' ');
 while ( $word !== false ) {
     $result[] = $word;
     $word = strtok(' ');
 }

see more on php documentation for strtok()

answered Apr 26, 2009 at 10:29

farzad

8,8556 gold badges35 silver badges44 bronze badges

2 Comments

roopunk Over a year ago

what is the difference between this and explode(' ', $text);

farzad Over a year ago

The code sample in the question is a tokenizer, my answer was implying that PHP has a string tokenizer built-in. Also explode() will return all of the words of the text at once, but using strtok() the caller has the choice to stop searching for words in the text, as soon as a desired condition is met. Other than this, I can't think of any other difference.

Alix Axel · Accepted Answer · 2009-04-26 10:36:02Z

1

Do:

str_word_count($text, 1);

Or if you need unicode support:

function str_word_count_Helper($string, $format = 0, $search = null)
{
    $result = array();
    $matches = array();

    if (preg_match_all('~[\p{L}\p{Mn}\p{Pd}\'\x{2019}' . preg_quote($search, '~') . ']+~u', $string, $matches) > 0)
    {
        $result = $matches[0];
    }

    if ($format == 0)
    {
        return count($result);
    }

    return $result;
}

edited Apr 26, 2009 at 10:36

answered Apr 26, 2009 at 10:24

Alix Axel

155k100 gold badges406 silver badges509 bronze badges

4 Comments

caw Over a year ago

Thanks but this wouldn't work. "Fri3nd" wouldn't be extracted but it should.

David Thomas Over a year ago

I don't understand why "Fri3nd" should be extracted. Removed from the array, broken down into "Fri3" and "nd" (or similar)? O.o

Alix Axel Over a year ago

If you want to consider numbers as words just do str_word_count_Helper($string, 1, '0123456789');

mickmackusa Over a year ago

Native PHP functions that allow double-dot range syntax demonstrates that str_word_count($string, 1, '0..9') will do.

jfgrang · Accepted Answer · 2012-10-10 00:23:46Z

1

You can also use the method explode : http://php.net/manual/en/function.explode.php

$words = explode(" ", $sentence);

answered Oct 10, 2012 at 0:23

jfgrang

1,17813 silver badges13 bronze badges

1 Comment

LucScu Over a year ago

not works with 2 or more consecutive spaces. you have to use a foreach with explode(" ", $sentence) within if($word == "") continue; so you could avoid empty words.

Collectives™ on Stack Overflow

Create array of words from a string of text

6 Answers 6

10 Comments

4 Comments

2 Comments

2 Comments

4 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

10 Comments

4 Comments

2 Comments

2 Comments

4 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related