0

I want to extract some data from the url's which have the following format :

http://www.example.com/biglasses/pr?p[]=ets.ideal_for%255B%255D%3Ds&p[]=ets.ideal_for%255B%255D%3Dn&p[]=sort%3Dpopularity&sid=23426x&offer=bigglassesMin30_RipoP.&ref=8be2b7f4-521c-4c45-9021-33d1df588eb9&mycracker=ch_vn_men_sungla_promowidget_banner_0_image

http://www.example.com/cooks/cooking-dress-wine/~no-order/pr?p%5B%5D=sort%3Dfeatured&sid=bks%2C43p&mycracker=ch_vn_clothing_subcategory_Puma&ref=b41c8097-8efe-4acf-8919-0fa81bcb590a

http://www.example.com/biglasses/pr?p[]=ets.ideal_for%255B%255D%3Ds&p[]=ets.ideal_for%255B%255D%3Dn&p[]=sort%3Dpopularity&sid=23426x&ref=8be2b7f4-521c-4c45-9021-33d1df588eb9&mycracker=ch_vn_men_sungla_promowidget_banner_0_image&offer=bigglassesMin30_RipoP.

Basically I want to get rid of &myCracker and its value and &ref and its value and the domain name part i.e http://www.example.com

As can be seen the useful part of the url data is interspersed between these characters namely &myCracker and its value and &ref and its value.

I am trying like this :

var mapObj = {"/^(http:\/\/)?.*?\//":"","(&mycracker.+)":"","(&ref.+)":""};
var re = new RegExp(Object.keys(mapObj).join("|"),"gi");
url = url.replace(re, function(matched){
    return mapObj[matched];
});

So that I could replace all the matching parts at once with an empty string.
But its not working.

I understand I need to selectively remove those parts of the url without making any assumptions about their order of appearance, but how should I go about it.

Thanks

4
  • are you wanting to keep the values of &ref and &mycracker in the URL, or remove those as well Commented Jan 29, 2014 at 18:08
  • I want to remove the value of &mycracker and &ref and its value portion. Sorry for lack of clarity I have edited my problem. Commented Jan 29, 2014 at 18:10
  • great. will &ref and &mycracker always be at the end of the URL and next to each other or are these likely to change? Commented Jan 29, 2014 at 18:11
  • No there positions are not fixed and its possible that a part of the url i.e (&offer and its value) can come after both of them. And I need that for further processing. Commented Jan 29, 2014 at 18:14

3 Answers 3

2

The easiest way would be to replace them with an empty string, leaving just the bits you want.

inputStr.replace(/^https?:\/\/[^\/]+\/|&?(mycracker|ref)=[^&]*/g, '')

Here is a JSFiddle: http://jsfiddle.net/4L6BH/1/

The regex is pretty straight forward. There are essentially two parts grouped together: ^https?:\/\/[^\/]+\/ and &?(mycracker|ref)=[^&]*

The first part gets any domain (with any sub-domains). If you are just using one domain, you could clarify it to just that one domain (but that would also reduce flexibility). It also optionally does both http and https protocols (hence the s?).

The second part gets the parameters that we don't care about and scraps them. Since they may be at the beginning (and thus not have an &), we only optionally look for that. We then have the items we want to replace, delimited with a |. Then we scoop up it's value, which would be anything until the next & or the end of string).

The last special bit, we add the g flag to make sure it replaces all instances (without it, it'll only do the first thing, which would be the domain).

We just grab those bits, replace them with an empty string, and viola.

Sign up to request clarification or add additional context in comments.

4 Comments

Good, would you like to elaborate on the regex. I also was looking for something like this.
I think I would like to retain the last / in the domain name. I can use something like inputStr.replace(/^https?:\/\/[^\/]+|&?(mycracker|ref)=[^&]*/g, '')
Added some explanations.
For your comment, yes, if you want to retain that slash, just drop that \/ like you showed.
1

The JavaScript string.replace function sends the text that was matched in the matched parameter. The code seems to expect it to return the regular expression text that was used as a key in mapObj. Perhaps it should just be url.replace(re,'')

The first regex shouldn't start or end with a "/".

2 Comments

Which one are you talking about, this one "/^(http:\/\/)?.*?\//"
Yes, that should be ^(http:\/\/)?.*?\/'
1

I would go with @samanime, but make a slight change.

Find: /^https?:\/\/[^\/]+|(?:(\?)|&)(?:mycracker|ref)=[^&]*/g Replace '\1'

    ^ https?:// [^/]+      
 |       
    (?:     
         ( \? )               # (1)     
      |  &     
    )     
    (?: mycracker | ref )     
    = [^&]*      

edit
Not knowing the parameters in url lines, but just as a parsing note ..
Stripping out the vars could be done like below.
I could be way off here, but if the ? is used as a domain/parameter list
separator, to maintain continuity, a couple of extra conditions might apply.
Still need to replace with capture group 1 every time.

     #  /^https?:\/\/[^\/]+|(?:(\?)(?:mycracker|ref)=[^&]*&)|(?:\?(?:mycracker|ref)=[^&]*$)|(?:&(?:mycracker|ref)=[^&]*)/g

     # Domain
     ^ https?:// [^/]+ 
  |  
     # (?)var=&
     (?:
          ( \? )               # (1)
          (?: mycracker | ref )
          = [^&]*      
          &                    # &
     )
  |  
     # ?var=(EOS)
     (?:
          \?
          (?: mycracker | ref )
          = [^&]*      
          $                    # EOS
     )
  |  
     # &var=
     (?:
          &     
          (?: mycracker | ref )
          = [^&]*      
     )

2 Comments

Small change to what effect ?
I don't know url parameter form. But, if a ? separates (denotes) the start of variables, then this change would match it in lieu of & which appears to be a variable separator. It stops from matching &this mycracker= . Basically it leaves in the ? if in this position. So, it properly handles this,'/p?mycracker=A&mycracker=B&thismycracker=C'

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.