1

I need a function to extract just the name from the URL.

Like this when the input is www.google.com I want the output to be google.

www.facebook.com -> facebook

After a few searches I found this function parse_url($url, PHP_URL_HOST); With this function when i input www.google.com/blahblah/blahblah i get the output as www.google.com

6
  • Why? The last part often isn't useful without the first part. For example, suppose you had "jash.jacob.com", why would you just want the "jacob"? Commented Aug 29, 2013 at 15:38
  • 1
    What about when you put in www.mail.google.com? What should the output be in that case? Commented Aug 29, 2013 at 15:38
  • www is just a subdomain. However the subdomain may have significant impact on the outcome of the adress, so stripping it away is a bad idea. Also the tld is required. (google.de and google.com -> different outcome) Commented Aug 29, 2013 at 15:40
  • After using parse_url, have you tried removing the www. and .com? It is just a few string operations; use the documentation, do not be afraid to make mistakes, and you will figure it out quickly. Commented Aug 29, 2013 at 15:43
  • 1
    i don't think what you are trying to do is easily achievable as there is nothing to stop me buying mydomain.co.uk and setting up a server hosted at im.going.to.mess.your.function.up.with.mydomain.co.uk. But if you detail what problem this function is solving there is more than likely a solution out there for it. Commented Aug 29, 2013 at 15:52

3 Answers 3

1

There is only one half-way reliable way to do this I think and you'll need to create a class for it; personally I use something like namespace\Domain extends namespace\URI sort of thing - a Domain, essentially being a subset of a URI - technically I create 2 classes.

Your domain will probably need a static class member to hold the list of valid TLDs and this may as well exist in the URI class as you may want to reuse it with other sub-classes.

namespace My;

class URI {

  protected static $tldList;
  private static $_tldRepository = 'http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1';

  protected $uri;

  public function __construct($sURI = "") {
    if(!self::$tldList) {

      //static method to load the TLD list from Mozilla
      //  and parse it into an array, which sets self::$tldList
      self::loadTLDList();
    }

  //if the URI has been passed in - set it
  if($sURI) $this->setURI($sURI);
  }

  public function setURI($sURI) {
    $this->uri = $sURI; //needs validation and sanity checks of course
  }

  public function getURI() {
    return $this->uri;
  }


  //other methods ...

}

In reality I actually make a copy of the TLD list to a file on the server and use that, and only update it every 6 months to avoid the overhead of reading in the full TLD list when you first create a URI object on any page.

Now you may have a Domain sub-class that extends \My\URI and allows you to break the URI down into component parts - there might be a method to remove the TLD (based on the TLD list you've loaded into parent::$tldList from mxr.mozilla.org) once you've taken out the valid TLD whatever is just to the left of it (between the last . and the TLD) should be the domain, anything left of that would be sub-domains.

You can have methods to extract that data as required as well.

Sign up to request clarification or add additional context in comments.

4 Comments

the only problem I see with this is second level domains will not be filtered out eg content.met.police.uk .police is a second level domain according to wiki en.wikipedia.org/wiki/.uk and there are thousands of domains that will fit this pattern.
Hmmm good point - but unless you're going to maintain a list of all second level domains yourself (which would be impractical) I can't really see a way around it :\ I mainly use this as part of my email validation so that I can get a domain name to throw at getmxrr() - it works well in practice but yeah, it's not 100%.
^ but then, in practice, I never actually strip the TLD although the class allows me to attempt it by using this method in conjunction with a couple of (really long) RegExps stored as class constants.
Thats what i mean by my comment i dont think what is being asked is easily achievable, and depending on the problem there may be a better solution than what the OP is asking for. I think you answer is a good punt, but like I have said its going to fall down in 1000's of domains. But without gaining the OP's intent its difficult to recommend any answer currently. Tres well done on the best answer here though! :-)
0

This does what you ask though I agree with the comments about stripping the TLD

preg_match("/([^\.\/]+)\.[a-z\.]{2,6}$/i", "http://www.google.com", $match);
echo $match[1];

It's essentially matching the part before the TLD. I believe the RFC specifies that the longest public TLD can be 6 characters. The TLD part isn't fool proof but it functions for most inputs.

Comments

0

Regex and parse_url() aren't solution for you.

You need package that uses Public Suffix List, only in this way you can correctly extract domains with two-, third-level TLDs (co.uk, a.bg, b.bg, etc.) and multilevel subdomains.

I recomend use TLD Extract. Here example of code:

$extract = new LayerShifter\TLDExtract\Extract();

$result = $extract->parse('www.google.com/blahblah/blahblah');
$result->getHostname(); // will return (string) 'google'
$result->getRegistrableDomain(); // will return (string) 'google.com'

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.