1

I want to extract website names from the url. For e.g. https://plus.google.com/in/test.html should give the output as - "plus google"

Some more testcases are -

  1. WWW.OH.MADISON.STORES.ADVANCEAUTOPARTS.COM/AUTO_PARTS_MADISON_OH_7402.HTML

Output:- OH MADISON STORES ADVANCEAUTOPARTS

  1. WWW.LQ.COM/LQ/PROPERTIES/PROPERTYPROFILE.DO?PROPID=6054

Output:- LQ

  1. WWW.LOCATIONS.DENNYS.COM

Output:- LOCATIONS DENNYS

  1. WV.WESTON.STORES.ADVANCEAUTOPARTS.COM

Output:- WV WESTON STORES ADVANCEAUTOPARTS

  1. WOODYANDERSONFORDFAYETTEVILLE.NET/

Output:- WOODYANDERSONFORFAYETTEVILLE

  1. WILMINGTONMAYFAIRETOWNCENTER.HGI.COM

Output:- WILMINGTONMAYFAIRETOWNCENTER HGI

  1. WHITEHOUSEBLACKMARKET.COM/

Output:- WHITEHOUSEBLACKMARKET

  1. WINGATEHOTELS.COM

Output:- WINGATEHOTELS

string = str(input("Enter the url "))
new_list = list(string)
count=0
flag=0

if 'w' in new_list:
    index1 = new_list.index('w')
    new_list.pop(index1)
    count += 1
if 'w' in new_list:
    index2 = new_list.index('w')
    if index2 != -1 and index2 == index1:
        new_list.pop(index2)
        count += 1
if 'w' in new_list:
    index3= new_list.index('w')
    if index3!= -1 and index3== index2 and new_list[index3+1]=='.':
        new_list.pop(index3)
        count+=1      
        flag = 1
if flag == 0:
    start = string.find('/')
    start += 2
    end = string.rfind('.')

    new_string=string[start:end]
    print(new_string)
elif flag == 1:
    start = string.find('.')
    start = start + 1
    end = string.rfind('.')

    new_string=string[start:end]
    print(new_string)

The above works for some testcases but not all. Please help me with it.

Thanks

2 Answers 2

3

this is something you could build upon; using urllib.parse.urlparse:

from urllib.parse import urlparse

tests = ('https://plus.google.com/in/test.html',
         ('WWW.OH.MADISON.STORES.ADVANCEAUTOPARTS.COM/'
          'AUTO_PARTS_MADISON_OH_7402.HTML'),
         'WWW.LQ.COM/LQ/PROPERTIES/PROPERTYPROFILE.DO?PROPID=6054')

def extract(url):
    # urlparse will not work without a 'scheme'
    if not url.startswith('http'):
        url = 'http://' + url
    parsed = urlparse(url).netloc
    split = parsed.split('.')[:-1] # get rid of TLD
    if split[0].lower() == 'www':
        split = split[1:]
    ret = ' '.join(split)
    return ret

for url in tests:
    print(extract(url))
Sign up to request clarification or add additional context in comments.

Comments

1

The function strips the url from the double slash to the single slash: the rest is 'clean up'

def stripURL( url, TwoSlashes, OneSlash ):
    try:
        start = url.index(TwoSlashes) + len(TwoSlashes)
        end = url.index( OneSlash, start )
        return url[start:end]
    except ValueError:
        return ""
url= raw_input("URL : ")
if "www." in url:url=url.replace("www.","")
Strip = stripURL( url, "//", "/" )
# Strips anything after the last period found
Stripped = Strip[:Strip.rfind(".")]
# get rid of the any periods used in the name 
Stripped = Stripped.replace("."," ")
print Stripped

3 Comments

When I run the above code with input(for e.g.) - "www.google.com", there is no output.
The example used the format including http:// . I used the slashes to separate the url. copy a url from the address bar and it works
Yeah...now it works fine but I want that the code should support urls without https:// or http://..... Anyways thanks for helping

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.