1

I have the following example of messy data consisting of different strings. I wanna convert the urls to a website format type including the protocoll and a given path. Excluding the others is less important

Following is a panda series:

0                                 None
1   http://fakeurl.com/example/fakeurl
2    https://www.qwer.com/example/qwer
3                                 None
4                test.com/example/test
5                                 None
6                            123135123
7                            nourlhere
8                                  lol
9                             hello.tv
10                              nolink
11                  ihavenowebsite.com

In my code I first wanna convert all urls to simply have the plain domain.com + path if they have it and then I use regular expression to add the protocol. In a second regular expression I wanna add the path to those without path with the following pattern https://www.example.com/example/example so the end of the path should repeat the domain name

Code:

def change_by_regexp(dfc, regexp, string):
    dfc[~dfc.str.match(regexp)==False] = string
        
example = pd.Series(['None', 'http://fakeurl.com/example/fakeurl', 'https://www.qwer.com/example/qwer', 'None', 'test.com/example/test', 'None', '123135123', 'nourlhere', 'lol', 'hello.tv', 'nolink', 'ihavenowebsite.com'])
example = example.map(lambda x: x.replace('https://www.', ''))
example = example.map(lambda x: x.replace('www.', ''))
example = example.map(lambda x: x.replace('https://', ''))
example = example.map(lambda x: x.replace('http://', ''))

change_by_regexp(example, r'([-a-zA-Z0-9\u0080-\u024F@:%._\+~#=]{1,256})\.[a-zA-Z0-9()]{1,6}\b','http://www.' + example)
change_by_regexp(example, r'^((http[s]?|ftp):\/)?\/?([-a-zA-Z0-9\u0080-\u024F@:%._\+~#=]{1,256})\.[a-zA-Z0-9()]{1,6}\b$', example + '/example/')
print(example)

Output:

0                                       None
1     http://www.fakeurl.com/example/fakeurl
2           http://www.qwer.com/example/qwer
3                                       None
4           http://www.test.com/example/test
5                                       None
6                                  123135123
7                                  nourlhere
8                                        lol
9               http://www.hello.tv/example/
10                                    nolink
11    http://www.ihavenowebsite.com/example/
dtype: object

Is there a method now to take the hostname and return it at the end of the path? Is it maybe possible to do that by using another regex which is searching for the hostname and return it? I simply couldn't find a good solution yet. To reach my...

Expected Output:

0                                                   None
1                 http://www.fakeurl.com/example/fakeurl
2                       http://www.qwer.com/example/qwer
3                                                   None
4                       http://www.test.com/example/test
5                                                   None
6                                              123135123
7                                              nourlhere
8                                                    lol
9                      http://www.hello.tv/example/hello
10                                                nolink
11  http://www.ihavenowebsite.com/example/ihavenowebsite
dtype: object

2 Answers 2

1

Refactored your code a little to make it more readable. I use urllib.parse to do final part.

mport re
import urllib.parse
example = pd.Series(['None', 'http://fakeurl.com/example/fakeurl', 'https://www.qwer.com/example/qwer', 'None', 'test.com/example/test', 'None', '123135123', 'nourlhere', 'lol', 'hello.tv', 'nolink', 'ihavenowebsite.com'])

re1 = r'([-a-zA-Z0-9\u0080-\u024F@:%._\+~#=]{1,256})\.[a-zA-Z0-9()]{1,6}\b'
re2 = r'^((http[s]?|ftp):\/)?\/?([-a-zA-Z0-9\u0080-\u024F@:%._\+~#=]{1,256})\.[a-zA-Z0-9()]{1,6}\b$'
re3 = r'www\.([\w]*)'

def modurl(s):
    u = urllib.parse.urlparse(s)
    if u.netloc=="" or u.path!="/example":
        return s
    else:
        return f"{s}/{re.findall(re3, urllib.parse.urlparse(s).netloc)[0]}"

example = (example
 .map(lambda x: x.replace('https://www.', ''))
 .map(lambda x: x.replace('www.', ''))
 .map(lambda x: x.replace('https://', ''))
 .map(lambda x: x.replace('http://', ''))
 .map(lambda x: np.where(bool(re.search(re1, x)), "http://www."+x, x))
 .map(lambda x: np.where(bool(re.search(re2, x)), x+"/example", x))
 .map(lambda x: modurl(x))
)

print(example.to_string())

output

0                                                  None
1                http://www.fakeurl.com/example/fakeurl
2                      http://www.qwer.com/example/qwer
3                                                  None
4                      http://www.test.com/example/test
5                                                  None
6                                             123135123
7                                             nourlhere
8                                                   lol
9                     http://www.hello.tv/example/hello
10                                               nolink
11    http://www.ihavenowebsite.com/example/ihavenow...
Sign up to request clarification or add additional context in comments.

Comments

1
example = pd.Series(['None', 'http://fakeurl.com/example/fakeurl', 'https://www.qwer.com/example/qwer',
                     'None', 'test.com/example/test',
                     'None', '123135123', 'nourlhere', 'lol', 'hello.tv', 'nolink', 'ihavenowebsite.com'])

example=pd.DataFrame(example.rename('main'))

example['path']=example['main']\
    .str.replace('http://','')\
    .str.replace('www.','')\
    .str.replace('https://','')

#take only rows with adres( and extract this adres)
example.loc[
    example.path.str.contains('\.')
,'host']=example.loc[example.path.str.contains('\.'),'path'].str.split('.').apply(lambda x: x[0])

example['host'] = '/example/'+example['host']
#add path where is not example/host_name

example.loc[
    ~example.main.str.contains('/example/'),'main']=example.loc[
    ~example.main.str.contains('/example/'),'main']+example.loc[
    ~example.main.str.contains('/example/'),'host']

example.loc[example.main.isna(),'main'] = example.loc[example.main.isna(),'path']
example=example[['main']]
print(example)
                                         main
0                                        None
1          http://fakeurl.com/example/fakeurl
2           https://www.qwer.com/example/qwer
3                                        None
4                       test.com/example/test
5                                        None
6                                   123135123
7                                   nourlhere
8                                         lol
9                      hello.tv/example/hello
10                                     nolink
11  ihavenowebsite.com/example/ihavenowebsite

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.