0

trying to get a regex that will match a url e.g. 'http://www.test.com' and then going to put anchor tags around it - that part is working already with following:

regex = @"(?:(?:https?|ftp|file):\/\/|www\.|ftp\.)(?:\([-A-Z0-9+&@#\/%=~_|$?!:;,.]*\)|[-A-Z0-9+&@#\/%=~_|$?!:;,.])*(?:\([-A-Z0-9+&@#\/%=~_|$?!:;,.]*\)|[A-Z0-9+&@#\/%=~_|$])"
msg = r.Replace( msg, "<a target=\"_blank\" href=\"$0\">$0</a>" );

but when there are image tags in the input text it incorrectly puts anchor tags inside the image tag's src attribute e.g.

<img src="<a>...</a>" />;

so far I'm trying this to bypass that: (not working)

regex = @"(?!(src=""))(?:(?:https?|ftp|file):\/\/|www\.|ftp\.)(?:\([-A-Z0-9+&@#\/%=~_|$?!:;,.]*\)|[-A-Z0-9+&@#\/%=~_|$?!:;,.])*(?:\([-A-Z0-9+&@#\/%=~_|$?!:;,.]*\)|[A-Z0-9+&@#\/%=~_|$])"

EDIT:

(example testing input):

<p>
    www.test1.com<br />
    <br />
    http://www.test2.com<br />
    <br />
    https://www.test3.com<br />
    <br />
    &quot;https://www.test4.com<br />
    <br />
    &#39;https://www.test4.com<br />
    <br />
    =&quot;https://www.test4.com</p>
<p>
    &nbsp;</p>
<p>
    <img alt="" src="..." style="width: 500px; height: 375px;" /></p>

(example output):

<p>
    <a target="_blank" href="www.test1.com">www.test1.com</a><br />
    <br />
    <a target="_blank" href="http://www.test2.com">http://www.test2.com</a><br />
    <br />
    <a target="_blank" href="https://www.test3.com">https://www.test3.com</a><br />
    <br />
    &quot;<a target="_blank" href="https://www.test4.com">https://www.test4.com</a><br />
    <br />
    &#39;<a target="_blank" href="https://www.test4.com">https://www.test4.com</a><br />
    <br />
    =&quot;<a target="_blank" href="https://www.test4.com">https://www.test4.com</a></p>
<p>
    &nbsp;</p>
<p>
    <img alt="" src="<a target="_blank" href="...">...</a>" style="width: 500px; height: 375px;" /></p>

(desired output ):

<p>
    <a target="_blank" href="www.test1.com">www.test1.com</a><br />
    <br />
    <a target="_blank" href="http://www.test2.com">http://www.test2.com</a><br />
    <br />
    <a target="_blank" href="https://www.test3.com">https://www.test3.com</a><br />
    <br />
    &quot;<a target="_blank" href="https://www.test4.com">https://www.test4.com</a><br />
    <br />
    &#39;<a target="_blank" href="https://www.test4.com">https://www.test4.com</a><br />
    <br />
    =&quot;<a target="_blank" href="https://www.test4.com">https://www.test4.com</a></p>
<p>
    &nbsp;</p>
<p>
    <img alt="" src="..." style="width: 500px; height: 375px;" /></p>
2
  • it is foggy - I can understand what you want in general but not in precise, Can you please show a list of 5 correct inputs and 5 correct outputs, and than give like 2 correct inputs which yields 2 incorrect outputs? Commented May 15, 2012 at 8:54
  • I added the inputs and outputs I'm testing with currently Commented May 15, 2012 at 9:00

2 Answers 2

1

Processing HTML using Regex is a wrong aproach in my opnion.

Putting that to aside - just add that rule after your regex match success:

if(regexResult.Count(c => c == '/') > 2) regexResult has more than two '/' charcters it's an invalid result;

You can add this rule to your regex pattern if it solves your problem.

Sign up to request clarification or add additional context in comments.

3 Comments

I agree regex is not a good way to deal with html, but its part of the current working solution i have which i just need to modify a bit, not sure how that .Count() will help?
Since you looking for url's like http : / / www.somthing.ext but not http : / /www.somthing.ext/somthing.jpg , it will filter out those results which has more than two slashes. it also limit you to use only root urls.
actually it does work with url's that have any number of slashes, I've fixed this issue since I last posted , i'll post my solution below
0

Here's the regex that solved the issue for me:

String regex = @"(?<!(""|'))((http|https|ftp|file):\/\/|www\.|ftp\.)(?:\([-A-Z0-9+&@#\/%=~_|$?!:;,.]*\)|[-A-Z0-9+&@#\/%=~_|$?!:;,.])*(?:\([-A-Z0-9+&@#\/%=~_|$?!:;,.]*\)|[A-Z0-9+&@#\/%=~_|$])";

I used a lookback negative assertion to make sure that the url doesn't have an opening quote before it

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.