1

I have a bit of a strange one here, I basically have a large chunk of text which may or may not contain links to images.

So lets say it does I have a pattern which will extract the image url fine, however once a match is found it is replaced with a element with the link as the src. Now the problem is there may be multiple matches within the text and this is where it gets tricky. As the url pattern will now match the src tags url, which will basically just enter an infinite loop.

So is there a way to ONLY match in regex if it doesnt start with a pattern like ="|=' ? as then it would match the url in something like:

some image http://cdn.sstatic.net/stackoverflow/img/sprites.png?v=6

but not

some image <img src="http://cdn.sstatic.net/stackoverflow/img/sprites.png?v=6">

I am not sure if it is possible, but if it is could someone point me in the right direction? A replace by itself will not suffice in this scenario as the url matched needs to be used elsewhere too so it needs to be used like a capture.

The main scenarios I need to account for are:

  • Many links in one block of varied text
  • A single link without any other text
  • A single link with other varied text

== edit ==

Here is the current regex I am using to match urls:

(\b(https?|ftp|file):\/\/[-A-Z0-9+&@#\/%?=~_|!:,.;]*(?:png|jpeg|jpg|gif|bmp))

== edit 2 ==

Just so everyone understands why I cannot use the /g command here is an answer which explains the issue, if I could use this /g like I originally tried then it would make things a lot simpler.

Javascript regex multiple captures again

4
  • 2
    Have you tried using the /g command, which should do a single global replace, rather than having to loop through until a match is "not found"? Commented Sep 27, 2013 at 9:37
  • In javascript it doesnt seem to work, there is some problem with multiple captures and exec, so you need to loop round until no matches remain. I read something about JS doesnt support captures or multiple matches in a single result, although if you can prove the above in a jsfiddle or something I will happily give you the answer as I could never get it to work. Commented Sep 27, 2013 at 9:40
  • Why is there a downvote to the question, this is a well defined question given the constraints and the scenario. Commented Sep 27, 2013 at 9:52
  • 1
    try this jQuery based jsfiddle... although it does highlight that the query string part of the string isn't taken into account. If you want vannilla JS, this this jsfiddle Commented Sep 27, 2013 at 9:57

4 Answers 4

3

What you are looking for is a negative look behind, but Javascript doesn't support any kind of look behinds, so you will either have to use a callback function to check what was matched and make sure it is not preceded by a ' or ", or you can use the following regex:

(?:^|[^"'])(\b(https?|ftp|file):\/\/[-a-zA-Z0-9+&@#\/%?=~_|!:,.;]*(?:png|jpeg|jpg|gif|bmp))

which has a single problem, that is in the case of a successful match it will catch one more character, the one right before the (\b(https?|ftp|file) pattern in the input, but I think you can deal with this easily.

Regex101 Demo

Sign up to request clarification or add additional context in comments.

1 Comment

this seems to work and addresses the questions context slightly better, as the other answers which are very useful are less about tackling the pattern at the start and changing tact to get the replace to work in 1 go.
1

Using the /ig command at the end should work... the g is for global replace and the i is for case-insensitivity, which is necessary as you've only got A-Z instead of a-zA-Z.

Using the following vanilla JS appears to work for me (see jsfiddle)...

var test="some image http://cdn.sstatic.net/stackoverflow/img/sprites.png?v=6 some image http://cdn.sstatic.net/stackoverflow/img/sprites.png?v=6 some image http://cdn.sstatic.net/stackoverflow/img/sprites.png?v=6";
var re = new RegExp(/(\b(https?|ftp|file):\/\/[-A-Z0-9+&@#\/%?=~_|!:,.;]*(?:png|jpeg|jpg|gif|bmp))/ig);
document.getElementById("output").innerHTML = test.replace(re,"<img src=\"$1\"/>");

Although, what it does highlight is that the query string part of the URL (the ?v=6 is not being picked up with your RegEx).

For jQuery, it would be (see jsfiddle)...

$(document).ready(function(){
  var test="some image http://cdn.sstatic.net/stackoverflow/img/sprites.png?v=6 some image http://cdn.sstatic.net/stackoverflow/img/sprites.png?v=6 some image http://cdn.sstatic.net/stackoverflow/img/sprites.png?v=6";
  var re = new RegExp(/(\b(https?|ftp|file):\/\/[-A-Z0-9+&@#\/%?=~_|!:,.;]*(?:png|jpeg|jpg|gif|bmp))/ig);
  $("#output").html(test.replace(re,"<img src=\"$1\"/>"));
});

Update

Just in case my example of using the same image URL in the example doesn't convince you - it also works with different URLs... see this jsfiddle update

var test="http://cdn.sstatic.net/stackoverflow/img/sprites.png?v=6 http://cdn.sstatic.net/serverfault/img/sprites.png?v=7";
var re = new RegExp(/(\b(https?|ftp|file):\/\/[-A-Z0-9+&@#\/%?=~_|!:,.;]*(?:png|jpeg|jpg|gif|bmp))/ig);
document.getElementById("output").innerHTML = test.replace(re,"<img src=\"$1\"/>");

4 Comments

Interesting, although the replace works how do you actually access the underlying match so you can make use of the captures data when doing it this way?
That's a good question @Grofit, and I'm sorry but I'm simply not aware of how you'd do that. The replace is based on simple pattern matching... if you need to explicit processing on each individual match then I believe (but am happy to be proved wrong) that you would have to do individual matches. If I'm right, I think there is a way to call an external function, but I've never done it and cannot give any advice in that direction... sorry!
That is fine buddy, if the question was simply about doing the replace then you would get the answer given javascript's limitations, however as the match still needs to be used outside of the replace I have given the answer to the other chap, but upvoted as im sure for most cases this would be the more applicable answer for most people doing similar.
@Grofit, not a problem fella, but it wasn't clear from your OP that you needed the ability to do extra processing on those matches. Good luck with the rest of your project :-)
0

Couldn't you just see if there is a whitespace in front of the url, instead of that word-boundary? seems to work, although you will have to remove the matched whitespace later.

(\s(https?|ftp|file):\/\/[-A-Z0-9+&@#\/%?=~_|!:,.;]*(?:png|jpeg|jpg|gif|bmp))

http://rubular.com/r/9wSc0HNWas

Edit: Damn, too slow :) I'll still leave this here as my regex is shorter ;)

3 Comments

what if the text was just a link, which had no whitespace before it. In that case it would not work :(
That's true, I did not know you expected something like this... Would you expect something like: here is some texthttp://.... ?
Nah, that is not too much of a worry as its a rare case and too hard to test for, it was mainly just the case of a link being posted as the sole content which I wanted to point out, but you are right it was not specifically mentioned on the question.
0

as was said by freefaller, you might use /g flag to just find all matches in one go, if exec is not a must.

otherwise: you can add (="|=')? to the beginning of your regex, and check if $1 is undefined. if it is undefined, then it was not started with a ="|=' pattern

2 Comments

the reason I cannot use the /g is explained here in the answer: stackoverflow.com/questions/14707360/…
my answer works even if exec is a must, but you could just use match or replace

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.