0

I am having some trouble to exact the string from URL using re library.

here's an example:

http://www.example.it/[email protected]&direction=vente.aspx%3pid%xx123%63abcd"

I have a dataframe and i want to add a column using a value from another column, in this example df['URL_REG'] contains: '123'?

df['URL_REG'] = df['URL'].map(lambda x : re.findall(r'[REGEX]+', x)[0])

the structure of URL can change but the part that i want comes always between 'direction=vente.aspx%3pid%' and '%'.

2 Answers 2

2

Use vectorized Series.str.extract() method:

In [50]: df['URL_REG'] = df.URL.str.extract(r'direction=vente.aspx\%3pid\%([^\%]+)\%*',
                                            expand=False)

In [51]: df
Out[51]:
                                                 URL URL_REG
0  http://www.example.it/remoteconnexion.aspx?u=x...   xx123

UPDATE:

i want only '123' part instead of 'xx123', where 'xx' is a hexademical number

In [53]: df['URL_REG'] = df.URL.str.extract(r'direction=vente.aspx\%3pid\%\w{2}(\d+)\%*', 
                                            expand=False)

In [54]: df
Out[54]:
                                                 URL URL_REG
0  http://www.example.it/remoteconnexion.aspx?u=x...     123
Sign up to request clarification or add additional context in comments.

3 Comments

its works, but i just forgot to mark that i want only '123' part instead of 'xx123'
@Omar14, do you have %xx in the URL or are those xx - digits?
most of the time i have one digit and one letter. '3d123' where xx = 3d.
0

You can use this pattern:

import re

url='http://www.example.it/[email protected]&direction=vente.aspx%3pid%xx123%63abcd'
output = re.findall('3pid%(.*?)%', url)

print(output)

Output:

['xx123']

Then apply the same pattern to your DataFrame.

For example:

import pandas as pd
import re

df = pd.DataFrame(['http://www.example.it/[email protected]&direction=vente.aspx%3pid%xx123%63abcd'], columns = ['URL'])

output = df['URL'].apply(lambda x : re.findall('3pid%(.*?)%', x))

print(output)

# Or, maybe if you want to return the url and the data captured:
# output = df['URL'].apply(lambda x : (x, re.findall('3pid%(.*?)%', x)))
# output[0]
# >>> ('http://www.example.it/[email protected]&direction=vente.aspx%3pid%xx123%63abcd', 
#   ['xx123'])

Output:

0    [xx123]
Name: URL, dtype: object

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.