Regex: Extract specific value from URL

Question

I am having some trouble to exact the string from URL using re library.

here's an example:

http://www.example.it/[email protected]&direction=vente.aspx%3pid%xx123%63abcd"

I have a dataframe and i want to add a column using a value from another column, in this example df['URL_REG'] contains: '123'?

df['URL_REG'] = df['URL'].map(lambda x : re.findall(r'[REGEX]+', x)[0])

the structure of URL can change but the part that i want comes always between 'direction=vente.aspx%3pid%' and '%'.

MaxU - stand with Ukraine · Accepted Answer · 2017-06-05 10:44:37Z

2

Use vectorized Series.str.extract() method:

In [50]: df['URL_REG'] = df.URL.str.extract(r'direction=vente.aspx\%3pid\%([^\%]+)\%*',
                                            expand=False)

In [51]: df
Out[51]:
                                                 URL URL_REG
0  http://www.example.it/remoteconnexion.aspx?u=x...   xx123

UPDATE:

i want only '123' part instead of 'xx123', where 'xx' is a hexademical number

In [53]: df['URL_REG'] = df.URL.str.extract(r'direction=vente.aspx\%3pid\%\w{2}(\d+)\%*', 
                                            expand=False)

In [54]: df
Out[54]:
                                                 URL URL_REG
0  http://www.example.it/remoteconnexion.aspx?u=x...     123

edited Jun 5, 2017 at 10:44

answered Jun 5, 2017 at 10:19

MaxU - stand with Ukraine

212k37 gold badges402 silver badges436 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Omar14 Over a year ago

its works, but i just forgot to mark that i want only '123' part instead of 'xx123'

MaxU - stand with Ukraine Over a year ago

@Omar14, do you have %xx in the URL or are those xx - digits?

Omar14 Over a year ago

most of the time i have one digit and one letter. '3d123' where xx = 3d.

Chiheb Nexus · Accepted Answer · 2017-06-05 10:32:18Z

You can use this pattern:

import re

url='http://www.example.it/[email protected]&direction=vente.aspx%3pid%xx123%63abcd'
output = re.findall('3pid%(.*?)%', url)

print(output)

Output:

['xx123']

Then apply the same pattern to your DataFrame.

For example:

import pandas as pd
import re

df = pd.DataFrame(['http://www.example.it/[email protected]&direction=vente.aspx%3pid%xx123%63abcd'], columns = ['URL'])

output = df['URL'].apply(lambda x : re.findall('3pid%(.*?)%', x))

print(output)

# Or, maybe if you want to return the url and the data captured:
# output = df['URL'].apply(lambda x : (x, re.findall('3pid%(.*?)%', x)))
# output[0]
# >>> ('http://www.example.it/[email protected]&direction=vente.aspx%3pid%xx123%63abcd', 
#   ['xx123'])

Output:

0    [xx123]
Name: URL, dtype: object

Collectives™ on Stack Overflow

Regex: Extract specific value from URL

2 Answers 2

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related