0

I have been trying to find an effective url parser, php's own does not include subdomain or extension. On php.net a number of users had contributed and made this:

function parseUrl($url) {
    $r  = "^(?:(?P<scheme>\w+)://)?";
    $r .= "(?:(?P<login>\w+):(?P<pass>\w+)@)?";
    $r .= "(?P<host>(?:(?P<subdomain>[-\w\.]+)\.)?" . "(?P<domain>[-\w]+\.(?P<extension>\w+)))";
    $r .= "(?::(?P<port>\d+))?";
    $r .= "(?P<path>[\w/]*/(?P<file>\w+(?:\.\w+)?)?)?";
    $r .= "(?:\?(?P<arg>[\w=&]+))?";
    $r .= "(?:#(?P<anchor>\w+))?";
    $r = "!$r!";                                                // Delimiters

    preg_match ( $r, $url, $out );

    return $out;
}

Unfortunately it fails on paths with a '-' and I can't for the life of me workout how to amend it to accept '-' in the path name.

Thanks

2 Answers 2

1

try this...

function parseUrl($url) {
    $r  = "^(?:(?P<scheme>\w+)://)?";
    $r .= "(?:(?P<login>\w+):(?P<pass>\w+)@)?";
    $r .= "(?P<host>(?:(?P<subdomain>[-\w\.]+)\.)?" . "(?P<domain>[-\w]+\.(?P<extension>\w+)))";
    $r .= "(?::(?P<port>\d+))?";
    $r .= "(?P<path>[\w/-]*/(?P<file>[\w-]+(?:\.\w+)?)?)?";
    $r .= "(?:\?(?P<arg>[\w=&]+))?";
    $r .= "(?:#(?P<anchor>\w+))?";
    $r = "!$r!";

    preg_match ( $r, $url, $out );

    return $out;
}

i added dashes to the path and file

Sign up to request clarification or add additional context in comments.

1 Comment

It works exactly as I wanted... thank you for actually answering the question :)
1

It's much easier to use a existing parse_url function and then parse the subdomain from the 'host' index.

Example:

$url = 'http://username:[email protected]/path?arg=value#anchor';
$urlInfo = parse_url($url);
$host = $urlInfo['host'];
$subdomain = substr($host, 0, strpos($host, '.'));
$tld = substr($host, strrpos($host, '.') + 1);

8 Comments

can you suggest how I might go about that?
@Mark - to do what? What are you trying to achieve?
I am looking to get the subdomain, tld, domain, path and arguments of a url. parse_url does not allow for subdomain or tld.
that will fail for tlds such as .co.uk
Well technically, .co.uk is not a TLD, only .uk is. If I understand it right, you will have to keep a manual list for those cases anyway - there are lots other "quasi-tld"s like .co.at, .co.in and so on.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.