I have the following example of messy data consisting of different strings. I wanna convert the urls to a website format type including the protocoll and a given path. Excluding the others is less important
Following is a panda series:
0 None
1 http://fakeurl.com/example/fakeurl
2 https://www.qwer.com/example/qwer
3 None
4 test.com/example/test
5 None
6 123135123
7 nourlhere
8 lol
9 hello.tv
10 nolink
11 ihavenowebsite.com
In my code I first wanna convert all urls to simply have the plain domain.com + path if they have it and then I use regular expression to add the protocol. In a second regular expression I wanna add the path to those without path with the following pattern https://www.example.com/example/example so the end of the path should repeat the domain name
Code:
def change_by_regexp(dfc, regexp, string):
dfc[~dfc.str.match(regexp)==False] = string
example = pd.Series(['None', 'http://fakeurl.com/example/fakeurl', 'https://www.qwer.com/example/qwer', 'None', 'test.com/example/test', 'None', '123135123', 'nourlhere', 'lol', 'hello.tv', 'nolink', 'ihavenowebsite.com'])
example = example.map(lambda x: x.replace('https://www.', ''))
example = example.map(lambda x: x.replace('www.', ''))
example = example.map(lambda x: x.replace('https://', ''))
example = example.map(lambda x: x.replace('http://', ''))
change_by_regexp(example, r'([-a-zA-Z0-9\u0080-\u024F@:%._\+~#=]{1,256})\.[a-zA-Z0-9()]{1,6}\b','http://www.' + example)
change_by_regexp(example, r'^((http[s]?|ftp):\/)?\/?([-a-zA-Z0-9\u0080-\u024F@:%._\+~#=]{1,256})\.[a-zA-Z0-9()]{1,6}\b$', example + '/example/')
print(example)
Output:
0 None
1 http://www.fakeurl.com/example/fakeurl
2 http://www.qwer.com/example/qwer
3 None
4 http://www.test.com/example/test
5 None
6 123135123
7 nourlhere
8 lol
9 http://www.hello.tv/example/
10 nolink
11 http://www.ihavenowebsite.com/example/
dtype: object
Is there a method now to take the hostname and return it at the end of the path? Is it maybe possible to do that by using another regex which is searching for the hostname and return it? I simply couldn't find a good solution yet. To reach my...
Expected Output:
0 None
1 http://www.fakeurl.com/example/fakeurl
2 http://www.qwer.com/example/qwer
3 None
4 http://www.test.com/example/test
5 None
6 123135123
7 nourlhere
8 lol
9 http://www.hello.tv/example/hello
10 nolink
11 http://www.ihavenowebsite.com/example/ihavenowebsite
dtype: object