1

I am new to regex expressions.

I am trying to filter out values from 2 columns.

The first column look like this:

_source.cookie
__cfduid=d118f225fac35345d9e1d87e533b596ec1574680126; gclid=EAIaIQobChMIhNSMxZyF5gIVjMjeCh3V2A-pEAAYASABEgJQBPD_BwE; full_path=https://google.com/free-test/windows/; country_code=OM; clid=06a98eb3-177a-4692-8a15-04cb4c084c1c; ct_t=1574680122; ct_tid=1574680122; _ga=GA1.2.575812751.1574680122; _gid=GA1.2.560773616.1574680122; _gac_UA-138885843-1=1.1574680161.EAIaIQobChMIhNSMxZyF5gIVjMjeCh3V2A-pEAAYASABEgJQBPD_BwE; _gat=1; _gcl_aw=GCL.1574680123.EAIaIQobChMIhNSMxZyF5gIVjMjeCh3V2A-pEAAYASABEgJQBPD_BwE; _gcl_au=1.1.1227740955.1574680123; sessionid=yr0pycyfhjh90vauf0z8yw4kxno5rom0; u_id=22b5d5e0-d2b5-4a4a-ad6f-128008b4b466; _gat_UA-138885843-1=1
...
__cfduid=de7d3a7e772a62b171f445ce489bc5f791574680110; gclid=CjwKCAiAlO7uBRANEiwA_vXQ-4dP3_zZJmNXCm-P2acHITBe1XbZZZmQIGKcrL9EaoP4r9CaYEQbPxoC1uQQAvD_BwE; full_path=https://google.com/au/free-test/; country_code=AU; ct_tid=1574680121; _ga=GA1.2.476582918.1574680125; _gid=GA1.2.1129397609.1574680125; _gat=1; _gcl_au=1.1.356653701.1574680128; _gat_UA-138885843-1=1; clid=3d0b5be5-8b7b-4094-ba47-879252a59a7a; ct_t=1574680159; _gcl_aw=GCL.1574680162.CjwKCAiAlO7uBRANEiwA_vXQ-4dP3_zZJmNXCm-P2acHITBe1XbZZZmQIGKcrL9EaoP4r9CaYEQbPxoC1uQQAvD_BwE; _gac_UA-138885843-1=1.1574680169.CjwKCAiAlO7uBRANEiwA_vXQ-4dP3_zZJmNXCm-P2acHITBe1XbZZZmQIGKcrL9EaoP4r9CaYEQbPxoC1uQQAvD_BwE
__cfduid=d3b31d4cba74d440bf60e238a62bf46a51574680162; gclid=CjwKCAiAlO7uBRANEiwA_vXQ-yQeCe4-vuWQiZapqU7H5-YODheBwQf2Ra0c8CZwjf1ZGSqkw1KKXxoCeYMQAvD_BwE; full_path=https://google.com/au/best-test/; country_code=AU; clid=4e65772c-5da2-471a-86dd-240a34fd36ac; ct_t=1574680164; ct_tid=1574680164; _ga=GA1.2.242059245.1574680165; _gid=GA1.2.1757216414.1574680165; _gac_UA-138885843-1=1.1574680165.CjwKCAiAlO7uBRANEiwA_vXQ-yQeCe4-vuWQiZapqU7H5-YODheBwQf2Ra0c8CZwjf1ZGSqkw1KKXxoCeYMQAvD_BwE; _gat=1; _gcl_aw=GCL.1574680165.CjwKCAiAlO7uBRANEiwA_vXQ-yQeCe4-vuWQiZapqU7H5-YODheBwQf2Ra0c8CZwjf1ZGSqkw1KKXxoCeYMQAvD_BwE; _gcl_au=1.1.1892979809.1574680165
__cfduid=d054c8a93d4874e31aef9f2966829fefc1574680166; gclid=CjwKCAiAlO7uBRANEiwA_vXQ--5YOAD-mFNQFuM0dbd7lHsRBZSfOvhQynhZMhNHkEX-m7gosL23ABoCyS4QAvD_BwE; full_path=https://google.com/au/free-test/; country_code=AU; clid=726ebc25-95b9-4507-b29d-998ab54a9eeb; ct_t=1574680164; ct_tid=1574680164; _ga=GA1.2.1271977185.1574680165; _gid=GA1.2.506750010.1574680165; _gac_UA-138885843-1=1.1574680165.CjwKCAiAlO7uBRANEiwA_vXQ--5YOAD-mFNQFuM0dbd7lHsRBZSfOvhQynhZMhNHkEX-m7gosL23ABoCyS4QAvD_BwE; _gat=1; _gcl_aw=GCL.1574680165.CjwKCAiAlO7uBRANEiwA_vXQ--5YOAD-mFNQFuM0dbd7lHsRBZSfOvhQynhZMhNHkEX-m7gosL23ABoCyS4QAvD_BwE; _gcl_au=1.1.24394228.1574680165
__cfduid=d27ba2095c6b6ac5fb6108343075969f11574679826; full_path=https://google.com/reviews/testtest/; country_code=VN; ct_tid=1574679826; _ga=GA1.2.2008368313.1574679827; _gid=GA1.2.1231813533.1574679827; _gcl_au=1.1.299737663.1574679827; sessionid=dqwf1zmqdjkv9tdqi1cotr6m2judep2p; u_id=a71d0a87-b93d-4626-8f51-bcc0550dbbee; gclid=EAIaIQobChMI-ZOE3ZyF5gIVy2ArCh37VAdGEAEYASAAEgLCaPD_BwE; clid=aeb5b4d0-400b-47ee-b916-69a7b03544aa; ct_t=1574680166; _gac_UA-138885843-1=1.1574680167.EAIaIQobChMI-ZOE3ZyF5gIVy2ArCh37VAdGEAEYASAAEgLCaPD_BwE; _gat=1; _gcl_aw=GCL.1574680167.EAIaIQobChMI-ZOE3ZyF5gIVy2ArCh37VAdGEAEYASAAEgLCaPD_BwE

The second column look like this:

_source.request_url
https://google.com/go/test/?p3
https://google.com/au/test/?gclid=CjwKCAiAlO7uBRANEiwA_vXQ--5YOAD-mFNQFuM0dbd7lHsRBZSfOvhQynhZMhNHkEX-m7gosL23ABoCyS4QAvD_BwE
https://google.com/go/test/
...
https://google.com/api/dto/?click_type=gclid&click_id=CjwKCAiAlO7uBRANEiwA_vXQ-yQeCe4-vuWQiZapqU7H5-YODheBwQf2Ra0c8CZwjf1ZGSqkw1KKXxoCeYMQAvD_BwE&click_src=GET&cid=242059245.1574680165&user_id=&landing_page_uri=https%3A%2F%2Fgoogle.com%2Fau%2Fbest-vpn%2F%3Fgclid%3DCjwKCAiAlO7uBRANEiwA_vXQ-yQeCe4-vuWQiZapqU7H5-YODheBwQf2Ra0c8CZwjf1ZGSqkw1KKXxoCeYMQAvD_BwE&landing_page_referer=&lpu=https%3A%2F%2Fgoogle.com%2Fau%2Fbest-vpn%2F&lpr=https%3A%2F%2Fwww.google.com%2F&trigger=onLoad&gtmon=true&gaon=true&cookieon=true&ct_t=1574680164&ct_tid=1574680164&v=20191029&_=1574680164139

My goal is to extract glid values from both columns so that I would have 2 new columns Glid from cookie and Gclid from URL

What I have so far:

def get_glid_from_source(pattern, data):

    result = re.search(pattern, str(data))
    if result is not None:
        return result.group(1)
    return None
df['Gclid_from_url'] = df.apply(lambda x: get_glid_from_source('[gclid|click_id]=(.+?)&', x['_source.request_url']), axis=1)

df['Gclid_from_cookie'] = df.apply(lambda x: get_glid_from_source('gclid=(.+?);', x['_source.cookie']), axis=1)

I would need to edit the expression so that:

1. Gclid can only start with a letter a-z or A-Z

2. Gclid ends with one of the following - ;, %, & or the end of string

Now after filtering I get values wfrom the second column that have click_id=ZFGe... because the value I need can be either gclid=Value I need or gclid&click_id= Value I need

EDIT

I have a third column which looks like this:

_source.request_url
www.google.com/api/test...
www.google.com/go/test...
www.google.com/fire-start.php/test...
www.google.com/test...
www.google.com/api/test...

I am making new column in the pandas dataframe where I add TRUE or FALSE values depending if these conditions are met in the above column:

If the link has:

.com/fire-start.php or .com/go/ or .com/api/

New column would have value as FALSE if the patter is not found in the string 'TRUE' value is passed.

What I tried:

df['validate'] = df['_source.request_url'].str.extract(r'(www.google)=([a-zA-Z][^.com/fire-start.php|^.com/go/|^.com/api/]*)')

But that does not seem to work.

Thank you for your help, appreciate it.

2
  • df['Gclid_from_url'] = df['_source.request_url'].str.extract(r'(?:gclid|click_id)=([a-zA-Z][^&#]*)') and df['Gclid_from_cookie'] = df['_source.cookie'].str.extract(r'gclid=([a-zA-Z][^&#]*)')? Commented Nov 28, 2019 at 10:25
  • The first one works well, thank you. The second one Gclid from cookie stars good but does not stop at ;. Commented Nov 28, 2019 at 10:30

1 Answer 1

2

You may use

df['Gclid_from_url'] = df['_source.request_url'].str.extract(r'(?:gclid|click_id)=([a-zA-Z][^&#]*)')

See the regex demo

The Gclid_from_cookie can be populated using

df['Gclid_from_cookie'] = df['_source.cookie'].str.extract(r'gclid=([a-zA-Z][^&#;%]*)')

See this regex demo

Note that [gclid|click_id] matches any 1 char defined in the character set, a g, c, l, i, d, |, k or _, not a sequence of chars, hence the non-capturing group (?:...) in my pattern.

The value pattern is [a-zA-Z][^&#]* or [a-zA-Z][^&#;%]* that is quite self-explanatory: an ASCII letter and 0 or more chars other than &, #, ;, %.

As far as the updated part of the question is concerned, you need to understand that a negated character class matches a single char, not a sequence of chars, you can't "say" [^not] to match any text but not, [^not] matches any char but n, o and t.

Add

import re
filters = ['.com/fire-start.php', '.com/go/', '.com/api/'] 
df['validate']=df['_source.request_url'].str.contains("|".join(map(re.escape,filters)))
Sign up to request clarification or add additional context in comments.

7 Comments

Thank you sir. So the first brackets set rules to what can be in the string - [a-zA-Z], the second one -[^&#;%] states what is not accepted and the * sign at the end means to stop after finding any charecters in the second brackets - [^&#;%]?
@JonasPalačionis Right. [^...] is a negated character class, it matches any char but the ones defined in the class. * is a quantifier matching 0 or more occurrences of the quantified subpattern.
Thank you again, I have added an edit to my question where I would need to check if the string has one of 3 patterns and add true or false values, I have followed the logic you've added in your answer but that does not seem to work, do you know where I have made a mistake?
@JonasPalačionis Wrong, it is not the same logic. That is some different question already. Add import re, then - if you have something like filters = ['.com/fire-start.php', '.com/go/', '.com/api/'] with the strings to check against - use df['validate']=df['_source.request_url'].str.contains("|".join(map(re.escape,filters)))
Thank you again, helped me a lot. Do you mind explain what does the pipe symbol indicate before .join method and what does re.escape do?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.