Adding a string by a regex from a regex

Question

I have the following example of messy data consisting of different strings. I wanna convert the urls to a website format type including the protocoll and a given path. Excluding the others is less important

Following is a panda series:

0                                 None
1   http://fakeurl.com/example/fakeurl
2    https://www.qwer.com/example/qwer
3                                 None
4                test.com/example/test
5                                 None
6                            123135123
7                            nourlhere
8                                  lol
9                             hello.tv
10                              nolink
11                  ihavenowebsite.com

In my code I first wanna convert all urls to simply have the plain domain.com + path if they have it and then I use regular expression to add the protocol. In a second regular expression I wanna add the path to those without path with the following pattern https://www.example.com/example/example so the end of the path should repeat the domain name

Code:

def change_by_regexp(dfc, regexp, string):
    dfc[~dfc.str.match(regexp)==False] = string
        
example = pd.Series(['None', 'http://fakeurl.com/example/fakeurl', 'https://www.qwer.com/example/qwer', 'None', 'test.com/example/test', 'None', '123135123', 'nourlhere', 'lol', 'hello.tv', 'nolink', 'ihavenowebsite.com'])
example = example.map(lambda x: x.replace('https://www.', ''))
example = example.map(lambda x: x.replace('www.', ''))
example = example.map(lambda x: x.replace('https://', ''))
example = example.map(lambda x: x.replace('http://', ''))

change_by_regexp(example, r'([-a-zA-Z0-9\u0080-\u024F@:%._\+~#=]{1,256})\.[a-zA-Z0-9()]{1,6}\b','http://www.' + example)
change_by_regexp(example, r'^((http[s]?|ftp):\/)?\/?([-a-zA-Z0-9\u0080-\u024F@:%._\+~#=]{1,256})\.[a-zA-Z0-9()]{1,6}\b$', example + '/example/')
print(example)

Output:

0                                       None
1     http://www.fakeurl.com/example/fakeurl
2           http://www.qwer.com/example/qwer
3                                       None
4           http://www.test.com/example/test
5                                       None
6                                  123135123
7                                  nourlhere
8                                        lol
9               http://www.hello.tv/example/
10                                    nolink
11    http://www.ihavenowebsite.com/example/
dtype: object

Is there a method now to take the hostname and return it at the end of the path? Is it maybe possible to do that by using another regex which is searching for the hostname and return it? I simply couldn't find a good solution yet. To reach my...

Expected Output:

0                                                   None
1                 http://www.fakeurl.com/example/fakeurl
2                       http://www.qwer.com/example/qwer
3                                                   None
4                       http://www.test.com/example/test
5                                                   None
6                                              123135123
7                                              nourlhere
8                                                    lol
9                      http://www.hello.tv/example/hello
10                                                nolink
11  http://www.ihavenowebsite.com/example/ihavenowebsite
dtype: object

Rob Raymond · Accepted Answer · 2020-08-25 10:36:25Z

Refactored your code a little to make it more readable. I use urllib.parse to do final part.

mport re
import urllib.parse
example = pd.Series(['None', 'http://fakeurl.com/example/fakeurl', 'https://www.qwer.com/example/qwer', 'None', 'test.com/example/test', 'None', '123135123', 'nourlhere', 'lol', 'hello.tv', 'nolink', 'ihavenowebsite.com'])

re1 = r'([-a-zA-Z0-9\u0080-\u024F@:%._\+~#=]{1,256})\.[a-zA-Z0-9()]{1,6}\b'
re2 = r'^((http[s]?|ftp):\/)?\/?([-a-zA-Z0-9\u0080-\u024F@:%._\+~#=]{1,256})\.[a-zA-Z0-9()]{1,6}\b$'
re3 = r'www\.([\w]*)'

def modurl(s):
    u = urllib.parse.urlparse(s)
    if u.netloc=="" or u.path!="/example":
        return s
    else:
        return f"{s}/{re.findall(re3, urllib.parse.urlparse(s).netloc)[0]}"

example = (example
 .map(lambda x: x.replace('https://www.', ''))
 .map(lambda x: x.replace('www.', ''))
 .map(lambda x: x.replace('https://', ''))
 .map(lambda x: x.replace('http://', ''))
 .map(lambda x: np.where(bool(re.search(re1, x)), "http://www."+x, x))
 .map(lambda x: np.where(bool(re.search(re2, x)), x+"/example", x))
 .map(lambda x: modurl(x))
)

print(example.to_string())

output

0                                                  None
1                http://www.fakeurl.com/example/fakeurl
2                      http://www.qwer.com/example/qwer
3                                                  None
4                      http://www.test.com/example/test
5                                                  None
6                                             123135123
7                                             nourlhere
8                                                   lol
9                     http://www.hello.tv/example/hello
10                                               nolink
11    http://www.ihavenowebsite.com/example/ihavenow...

user8560167 · Accepted Answer · 2020-08-25 10:15:53Z

example = pd.Series(['None', 'http://fakeurl.com/example/fakeurl', 'https://www.qwer.com/example/qwer',
                     'None', 'test.com/example/test',
                     'None', '123135123', 'nourlhere', 'lol', 'hello.tv', 'nolink', 'ihavenowebsite.com'])

example=pd.DataFrame(example.rename('main'))

example['path']=example['main']\
    .str.replace('http://','')\
    .str.replace('www.','')\
    .str.replace('https://','')

#take only rows with adres( and extract this adres)
example.loc[
    example.path.str.contains('\.')
,'host']=example.loc[example.path.str.contains('\.'),'path'].str.split('.').apply(lambda x: x[0])

example['host'] = '/example/'+example['host']
#add path where is not example/host_name

example.loc[
    ~example.main.str.contains('/example/'),'main']=example.loc[
    ~example.main.str.contains('/example/'),'main']+example.loc[
    ~example.main.str.contains('/example/'),'host']

example.loc[example.main.isna(),'main'] = example.loc[example.main.isna(),'path']
example=example[['main']]
print(example)
                                         main
0                                        None
1          http://fakeurl.com/example/fakeurl
2           https://www.qwer.com/example/qwer
3                                        None
4                       test.com/example/test
5                                        None
6                                   123135123
7                                   nourlhere
8                                         lol
9                      hello.tv/example/hello
10                                     nolink
11  ihavenowebsite.com/example/ihavenowebsite

Collectives™ on Stack Overflow

Adding a string by a regex from a regex

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related