3

I need to do a search and replace on long text strings. I want to find all instances of broken links that look like this:

<a href="http://any.url.here/%7BlocalLink:1369%7D%7C%7CThank%20you%20for%20registering">broken link</a>

and fix it so that it looks like this:

<a href="/{localLink:1369}" title="Thank you for registering">link</a>

There may be a number of these broken links in the text field. My difficulty is working out how to reuse the matched ID (in this case 1369). In the content this ID changes from link to link, as does the url and the link text.

Thanks,

David

EDIT: To clarify, I am writing C# code to run through hundreds of long text fields to fix broken links in them. Each single text field contains html that can have any number of broken links in there - the regex needs to find them all and replace them with the correct version of the link.

3
  • Do you want to match the tag also, or do you just want to apply the regex to the contents of the href attribute? Commented Apr 30, 2009 at 9:23
  • I just want to separate the incorrect href attribute in the first out so that it becomes the correct href and title attributes. I don't mind how that happens :) @tanascius - I'm coding this in C#. Commented Apr 30, 2009 at 9:59
  • I've corrected my regex, please try again. Commented Apr 30, 2009 at 14:30

4 Answers 4

2

Take this with a grain of salt, HTML and Regex don't play well together:

(<a\s+[^>]*href=")[^"%]*%7B(localLink:\d+)%7D%7C%7C([^"]*)("[^>]*>[^<]*</a>)

When applied to your input and replaced with

$1/{$2}" title="$3$4

the following is produced:

<a href="/{localLink:1369}" title="Thank%20you%20for%20registering">broken link</a>

This is as close as it gets with regex alone. You'll need to use a MatchEvaluator delegate to remove the URL encoding from the replacement.

Sign up to request clarification or add additional context in comments.

2 Comments

This is very close - thank you for helping. A couple of points: 1. The regex also matches correct links, which I don't want 2. It replaces the broken links, but not quite right, it gives: <a href="url.still.here/%7BlocalLink:1369%7D" title="}||Thank you for registering">link</a> - I need to remove the url.still.here bit, also the }|| in the title attribute. 3. The original source is html encoded, but I need the replaced text to use {localLink:1369} instead of %7BlocalLink:1369%7D. Can you help? Thanks, David
I've made a few changes to my regex, it should do it now.
2

I'm assuming that you already have the element and the attributes parsed. So to process the URL, use something like this:

    string url = "http://any.url.here/%7BlocalLink:1369%7D%7C%7CThank%20you%20for%20registering";
    Match match = Regex.Match(HttpUtility.UrlDecode(url), @"^http://[^/]+/\{(?<local>[^:]+):(?<id>\d+)\}\|\|(?<title>.*)$");
    if (match.Success) {
        Console.WriteLine(match.Groups["local"].Value);
        Console.WriteLine(match.Groups["id"].Value);
        Console.WriteLine(match.Groups["title"].Value);
    } else {
        Console.WriteLine("Not one of those URLs");
    }

Comments

2

To include the match in the replacement string, you use $&.

There are a number of other substitution markers that can be used in the replacement string, see here for the list.

Comments

1

Thanks to everyone for their help. Here is what I used in the end:

const string pattern = @"(<a\s+[^>""]*href="")[^""]+(localLink:\d+)(?:%7[DC])*([^""]+)(""[^>]*>[^<]*</a>)";
// Create a match evaluator to replace the matched links with the correct markup
var myEvaluator = new MatchEvaluator(FixLink);

var strNewText = Regex.Replace(strText, pattern, myEvaluator, RegexOptions.IgnoreCase);

internal static string FixLink(Match m)
    {
        var strUrl = m.ToString();
        const string namedPattern = @"(<a\s+[^>""]*href="")[^""]+(localLink:\d+)(?:%7[DC])*([^""]+)(""[^>]*>[^<]*</a>)";
        var regex = new Regex(namedPattern);

        //const string strReplace = @"$1/{$2}"" title=""$4";
        const string strReplace = @"$1/{$2}"" title=""$4";

        HttpContext.Current.Response.Write(String.Format("Replacing '{0}' with '{1}'", strUrl, regex.Replace(strUrl, strReplace)));
        return regex.Replace(strUrl, strReplace);
    }

1 Comment

I think you did not understand the use of the MatchEvaluator.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.