6

We need to generate a unique URL from the title of a book - where the title can contain any character. How can we search-replace all the 'invalid' characters so that a valid and neat lookoing URL is generated?

For instance:

"The Great Book of PHP"

www.mysite.com/book/12345/the-great-book-of-php

"The Greatest !@#$ Book of PHP"

www.mysite.com/book/12345/the-greatest-book-of-php

"Funny title     "

www.mysite.com/book/12345/funny-title
3

8 Answers 8

17

Ah, slugification

// This function expects the input to be UTF-8 encoded.
function slugify($text)
{
    // Swap out Non "Letters" with a -
    $text = preg_replace('/[^\\pL\d]+/u', '-', $text); 

    // Trim out extra -'s
    $text = trim($text, '-');

    // Convert letters that we have left to the closest ASCII representation
    $text = iconv('utf-8', 'us-ascii//TRANSLIT', $text);

    // Make text lowercase
    $text = strtolower($text);

    // Strip out anything we haven't been able to convert
    $text = preg_replace('/[^-\w]+/', '', $text);

    return $text;
}

This works fairly well, as it first uses the unicode properties of each character to determine if it's a letter (or \d against a number) - then it converts those that aren't to -'s - then it transliterates to ascii, does another replacement for anything else, and then cleans up after itself. (Fabrik's test returns "arvizturo-tukorfurogep")

I also tend to add in a list of stop words - so that those are removed from the slug. "the" "of" "or" "a", etc (but don't do it on length, or you strip out stuff like "php")

Sign up to request clarification or add additional context in comments.

5 Comments

Simple yet brilliant! +++ ;) (Now wondering what's that hocus-pocus inside WP source :o)
the Unicode matching only works on 5.1+ and iconv might not be installed on some servers - they have to cater for everyong.
If I may suggest an edit, I've added $text = utf8_encode($text); at the first line. Without this conversion, a string such as Mon titre français returned blank, whereas now it becomes mon-titre-francais.
@PubliDesign Then your internal encoding is not set to UTF-8. You can enforce this by using mb_internal_encoding('UTF-8') or setting responsible INI values. Your string is working out-of-the-box with @Mez's code.
@althaus, The original code doesn't force the string to be utf8, which may result in wierd unwanted characters (ex: ? in a black triangle). Having tried this string with the added $text = utf8_encode($text);, I have had great results after several tests.
7

If “invalid” means non-alphanumeric, you can do this:

function foo($str) {
    return trim(preg_replace('/[^a-z0-9]+/', '-', strtolower($str)), '-');
}

This will turn $str into lowercase, replace any sequence of one or more non-alphanumeric characters by one hyphen, and then remove leading and trailing hyphens.

var_dump(foo("The Great Book of PHP") === 'the-great-book-of-php');
var_dump(foo("The Greatest !@#$ Book of PHP") === 'the-greatest-book-of-php');
var_dump(foo("Funny title     ") === 'funny-title');

5 Comments

Fails too. Sorry. Please read the question: "the title can contain any character"
@fabrik: So what’s wrong? Didn’t you test the examples? They all yield true.
@fabrik: “If ‘invalid’ means non-alphanumeric […]” – matt_tm didn’t say anything about what invalid means. I just assumed that he means non-alphanumeric.
@Gumbo: Thank you for at least trying to understand what i'm talking about. Not only hungarian characters but given a book about Citroën and there you go. Accented characters in an international brand's name. Yes, OP didn't specified what is invalid and what is not but as he stated "the title can contain any character". (And, because we talking about books, there's a chance for accented characters.)
Hi - sorry to barge in your conversation and yes, non-English characters should be accounted for as well... Its not a terrible requirement that the 'visible' title be absolutely the same as the actual title, but it MUST be a valid url...
2

You can use a simple regular expression for this purpose:

<?php
    function safeurl( $v )
    {
        $v = strtolower( $v );
        $v = preg_replace( "/[^a-z0-9]+/", "-", $v );
        $v = trim( $v, "-" );
        return $v;
    }
    echo "<br>www.mysite.com/book/12345/" . safeurl( "The Great Book of PHP" );
    echo "<br>www.mysite.com/book/12345/" . safeurl( "The Greatest !@#$ Book of PHP" );
    echo "<br>www.mysite.com/book/12345/" . safeurl( "  Funny title  " );
    echo "<br>www.mysite.com/book/12345/" . safeurl( "!!Even Funnier title!!" );
?>

6 Comments

Sorry, Salman. I've tried your script with a hungarian sentence which contains all of our vowels and it's fails: ideone.com/WDcV8
@fabrik: no one said anything about hungarian. i'd -1 your comment if i could.
Does the question mention hungarian?
From the question: "where the title can contain any character".
This fails for leading or trailing invalid characters except whitespace.
|
1

If you want to allow only letters, digits and underscore (usual word characters) you can do:

$str = strtolower(preg_replace(array('/\W/','/-+/','/^-|-$/'),array('-','-',''),$str));

It first replaces any non-word character(\W) with a -.
Next it replaces any consecutive - with a single -
Next it deletes any leading or trailing -.

Working link

2 Comments

Go ahead and downvote Gumbo too. I bet you're having a bad day.
@Salman: Please understand it's not an easy preg_replace: core.trac.wordpress.org/browser/tags/3.0.1/wp-includes/…
1

This code comes from CodeIgniter's url helper. It should do the trick.

function url_title($str, $separator = 'dash', $lowercase = FALSE)
    {
        if ($separator == 'dash')
        {
            $search     = '_';
            $replace    = '-';
        }
        else
        {
            $search     = '-';
            $replace    = '_';
        }

        $trans = array(
                        '&\#\d+?;'              => '',
                        '&\S+?;'                => '',
                        '\s+'                   => $replace,
                        '[^a-z0-9\-\._]'        => '',
                        $replace.'+'            => $replace,
                        $replace.'$'            => $replace,
                        '^'.$replace            => $replace,
                        '\.+$'                  => ''
                      );

        $str = strip_tags($str);

        foreach ($trans as $key => $val)
        {
            $str = preg_replace("#".$key."#i", $val, $str);
        }

        if ($lowercase === TRUE)
        {
            $str = strtolower($str);
        }

        return trim(stripslashes($str));
    }

Comments

0

Replace special chars for white spaces and then replace white spaces for "-". str_replace?

1 Comment

Please explain how do you define special characters?
0

Use a regex replace to remove all non word characters. For example:

str_replace('[^a-zA-Z]+', '-', $input)

Comments

0
<?php
$input = "  The Great Book's of PHP  ";
$output = trim(preg_replace(array("`'`", "`[^a-z]+`"),  array("", "-"), strtolower($input)), "-");
echo $output; // the-great-books-of-php

This trims trailing dashes and doesn't do things like "it's raining" -> "it-s-raining" as most solutions tend to do.

5 Comments

@Gumbo: I find it preferable. Easier to read, no? Otherwise you read it like "it ess raining" and that's just weird.
“It’s” and “its” have a different meaning. The preferable variant would be to use its expanded (unambiguous) variant, so “it is” or “it has”.
@Gumbo: It's a URL. It's supposed to be short and concise.. if anything I'd strip out words like "is" and "has" too. No one is going to be looking for grammatical errors in a URL. And if they can't figure out "its-raining" actually means "it is raining" because there's no apostrophe....then... they need to go back to school.
@Mark: What about constructs with words that are ambiguous like its-meaning?
@Gumbo: When do you ever say "it is meaning"? And who cares? They can visit the website and read the actual title on the actual page in all its unicode glory.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.