How to extract a part of the url using JavaScript and Regex

Question

I want to extract some data from the url's which have the following format :

http://www.example.com/biglasses/pr?p[]=ets.ideal_for%255B%255D%3Ds&p[]=ets.ideal_for%255B%255D%3Dn&p[]=sort%3Dpopularity&sid=23426x&offer=bigglassesMin30_RipoP.&ref=8be2b7f4-521c-4c45-9021-33d1df588eb9&mycracker=ch_vn_men_sungla_promowidget_banner_0_image

http://www.example.com/cooks/cooking-dress-wine/~no-order/pr?p%5B%5D=sort%3Dfeatured&sid=bks%2C43p&mycracker=ch_vn_clothing_subcategory_Puma&ref=b41c8097-8efe-4acf-8919-0fa81bcb590a

http://www.example.com/biglasses/pr?p[]=ets.ideal_for%255B%255D%3Ds&p[]=ets.ideal_for%255B%255D%3Dn&p[]=sort%3Dpopularity&sid=23426x&ref=8be2b7f4-521c-4c45-9021-33d1df588eb9&mycracker=ch_vn_men_sungla_promowidget_banner_0_image&offer=bigglassesMin30_RipoP.

Basically I want to get rid of &myCracker and its value and &ref and its value and the domain name part i.e http://www.example.com

As can be seen the useful part of the url data is interspersed between these characters namely &myCracker and its value and &ref and its value.

I am trying like this :

var mapObj = {"/^(http:\/\/)?.*?\//":"","(&mycracker.+)":"","(&ref.+)":""};
var re = new RegExp(Object.keys(mapObj).join("|"),"gi");
url = url.replace(re, function(matched){
    return mapObj[matched];
});

So that I could replace all the matching parts at once with an empty string.
But its not working.

I understand I need to selectively remove those parts of the url without making any assumptions about their order of appearance, but how should I go about it.

Thanks

are you wanting to keep the values of &ref and &mycracker in the URL, or remove those as well — pizzarob
– pizzarob, Commented Jan 29, 2014 at 18:08
I want to remove the value of &mycracker and &ref and its value portion. Sorry for lack of clarity I have edited my problem. — John Doe
– John Doe, Commented Jan 29, 2014 at 18:10
great. will &ref and &mycracker always be at the end of the URL and next to each other or are these likely to change? — pizzarob
– pizzarob, Commented Jan 29, 2014 at 18:11
No there positions are not fixed and its possible that a part of the url i.e (&offer and its value) can come after both of them. And I need that for further processing. — John Doe
– John Doe, Commented Jan 29, 2014 at 18:14

samanime · Accepted Answer · 2014-01-31 18:56:42Z

2

The easiest way would be to replace them with an empty string, leaving just the bits you want.

inputStr.replace(/^https?:\/\/[^\/]+\/|&?(mycracker|ref)=[^&]*/g, '')

Here is a JSFiddle: http://jsfiddle.net/4L6BH/1/

The regex is pretty straight forward. There are essentially two parts grouped together: ^https?:\/\/[^\/]+\/ and &?(mycracker|ref)=[^&]*

The first part gets any domain (with any sub-domains). If you are just using one domain, you could clarify it to just that one domain (but that would also reduce flexibility). It also optionally does both http and https protocols (hence the s?).

The second part gets the parameters that we don't care about and scraps them. Since they may be at the beginning (and thus not have an &), we only optionally look for that. We then have the items we want to replace, delimited with a |. Then we scoop up it's value, which would be anything until the next & or the end of string).

The last special bit, we add the g flag to make sure it replaces all instances (without it, it'll only do the first thing, which would be the domain).

We just grab those bits, replace them with an empty string, and viola.

edited Jan 31, 2014 at 18:56

answered Jan 29, 2014 at 18:26

samanime

26.8k17 gold badges99 silver badges155 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

John Doe Over a year ago

Good, would you like to elaborate on the regex. I also was looking for something like this.

John Doe Over a year ago

I think I would like to retain the last / in the domain name. I can use something like inputStr.replace(/^https?:\/\/[^\/]+|&?(mycracker|ref)=[^&]*/g, '')

samanime Over a year ago

Added some explanations.

samanime Over a year ago

For your comment, yes, if you want to retain that slash, just drop that \/ like you showed.

maddoxej · Accepted Answer · 2014-01-29 18:22:43Z

1

The JavaScript string.replace function sends the text that was matched in the matched parameter. The code seems to expect it to return the regular expression text that was used as a key in mapObj. Perhaps it should just be url.replace(re,'')

The first regex shouldn't start or end with a "/".

answered Jan 29, 2014 at 18:22

maddoxej

1,66213 silver badges19 bronze badges

2 Comments

John Doe Over a year ago

Which one are you talking about, this one "/^(http:\/\/)?.*?\//"

maddoxej Over a year ago

Yes, that should be ^(http:\/\/)?.*?\/'

score 1 · Accepted Answer · 2014-01-29 19:57:45Z

1

I would go with @samanime, but make a slight change.

Find: /^https?:\/\/[^\/]+|(?:(\?)|&)(?:mycracker|ref)=[^&]*/g Replace '\1'

    ^ https?:// [^/]+      
 |       
    (?:     
         ( \? )               # (1)     
      |  &     
    )     
    (?: mycracker | ref )     
    = [^&]*

edit
Not knowing the parameters in url lines, but just as a parsing note ..
Stripping out the vars could be done like below.
I could be way off here, but if the ? is used as a domain/parameter list
separator, to maintain continuity, a couple of extra conditions might apply.
Still need to replace with capture group 1 every time.

     #  /^https?:\/\/[^\/]+|(?:(\?)(?:mycracker|ref)=[^&]*&)|(?:\?(?:mycracker|ref)=[^&]*$)|(?:&(?:mycracker|ref)=[^&]*)/g

     # Domain
     ^ https?:// [^/]+ 
  |  
     # (?)var=&
     (?:
          ( \? )               # (1)
          (?: mycracker | ref )
          = [^&]*      
          &                    # &
     )
  |  
     # ?var=(EOS)
     (?:
          \?
          (?: mycracker | ref )
          = [^&]*      
          $                    # EOS
     )
  |  
     # &var=
     (?:
          &     
          (?: mycracker | ref )
          = [^&]*      
     )

edited Jan 29, 2014 at 19:57

answered Jan 29, 2014 at 18:55

user557597

2 Comments

John Doe Over a year ago

Small change to what effect ?

user557597 Over a year ago

I don't know url parameter form. But, if a ? separates (denotes) the start of variables, then this change would match it in lieu of & which appears to be a variable separator. It stops from matching &this mycracker= . Basically it leaves in the ? if in this position. So, it properly handles this,'/p?mycracker=A&mycracker=B&thismycracker=C'

Collectives™ on Stack Overflow

How to extract a part of the url using JavaScript and Regex

3 Answers 3

4 Comments

2 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

2 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related