1

I'm trying to give users the ability to "mark" certain sections of content in a CMS with some additional 'tags' if you will, that will then get translated, for example, bold, when the content is rendered on the page.

Something like {strong:Lorum ipsum dolar} where the text will then be wrapped with <strong>Lorum ipsum dolar</strong>.

I've tried to figure the regex out for this, but I'm no good. I grabbed some html replacement scripts from sites, by they are not very helpful, at least, I don't know what to change :$.

Any help would be appreciated.

note

I'm doing this in C#.

7
  • What language do you wanna use to implement the regex? php? Commented May 10, 2012 at 13:30
  • 2
    This seems like it's going to present a whole world of problems with invalid tags. Commented May 10, 2012 at 13:31
  • 1
    $str = q({strong:Lorum ipsum dolar}); $str =~ m/\{(\w+):(.+?)\}/; $str = "<$1>$2</$1>"; awful solution, but works (perl) Commented May 10, 2012 at 13:42
  • 1
    @loldop: Unless and until they start nesting those things, as in {strong:Lor{i:e}m ipsum dol{i:o}r}. With Perl's extensions to regex, that would even be possible to do – in “pure” regex, you can only do it up to some predetermined nesting depth. Commented May 10, 2012 at 13:48
  • 1
    Rather than reinvent the wheel, why not just let them write HTML and forbid certain tags or use BBCode? There are lots of parsing options already available and some WYSIWYG editors for both HTML and BBCode. Commented May 10, 2012 at 15:18

3 Answers 3

1

This looks a lot like jSon to XML conversion.

{"strong":"Lorum ipsum dolar"} 

would become

<strong>Lorum ipsum dolar</strong>

and

{"strong":{italic:"Lorum ipsum dolar"}}

would become

<strong>
<italic>Lorum ipsum dolar</italic>
</strong>

I'm not saying this is the answer, but you might wanna look over that. The basic idea, would be to parse your tags into a hierarchical struct then parse it back to HTML or whatever output language you use.

Sign up to request clarification or add additional context in comments.

Comments

1

So this will get you the tags and parts you are looking for, however, the way I turn those results into the final string is pretty ugly. Its really just the regex at the top that matters. Enjoy!

string test = "{strong:lorem ip{i:su{b:m}m}m dolar} {strong:so strong}";
Regex tagParse = new Regex(
    @"\{(?<outerTag>\w*)
        (?>
            (?<DEPTH>\{(?<innerTags>\w*))
            |
            (?<-DEPTH>\})
            |
            :?(?<innerContent>[^\{\}]*)
        )*
        (?(DEPTH)(?!))

        ", RegexOptions.IgnorePatternWhitespace | RegexOptions.Singleline);

MatchCollection matches = tagParse.Matches(test);
foreach (Match m in matches)
{
    StringBuilder sb = new StringBuilder();
    List<string> tags = new List<string>();
    tags.Add(m.Groups["outerTag"].Value);
    foreach (Capture c in m.Groups["innerTags"].Captures)
        tags.Add(c.Value);
    List<string> content = new List<string>();
    foreach (Capture c in m.Groups["innerContent"].Captures)
        content.Add(c.Value);
    if (tags.Count > 1)
    {
        for (int i = 0; i < content.Count; i++)
        {
            if (i >= tags.Count)
                sb.Append("</" + tags[tags.Count - (i - tags.Count + 1)] + ">");
            else
                sb.Append("<" + tags[i] + ">");
            sb.Append(content[i]);
        }
        sb.Append("</" + tags[1] + ">");
    }
    else
    {
        sb.Append("<" + tags[0] + ">");
        sb.Append(content[0]);
    }
    sb.Append(m.Groups["outerContent"].Value);
    sb.Append("</" + m.Groups["outerTag"].Value + ">");
    Console.WriteLine(sb.ToString());
}  

Comments

0

Edit: To work with nested tags, multiple matches per input string. Restrictions: text inside a tag pair cannot contain "{" or "}".

private string FormatInput(string input)
{
    const string patternNonGreedy = @"\{(?<tag>.+?):(\s*)(?<content>.*?)(\s*)}";
    const string patternGreedy = @"\{(?<tag>.+?):(\s*)(?<content>.*)(\s*)}";

    Match mtc = Regex.Match(input, patternGreedy);
    if (!mtc.Success)
        return input;

    string content = mtc.Groups["content"].Value;
    int braces = 0;
    foreach (char c in content)
    {
        if (c == '{')
            braces++;
        else if (c == '}')
        {
            if (braces > 0)
                braces--;
        }
    }

    if (braces == 0)
        return input.Substring(0, mtc.Index)
            + string.Format("<{0}>{1}</{0}>", mtc.Groups["tag"].Value, FormatInput(content))
            + input.Substring(mtc.Index + mtc.Length);

    mtc = Regex.Match(input, patternNonGreedy);
    Debug.Assert(mtc.Success);

    content = mtc.Groups["content"].Value;
    return input.Substring(0, mtc.Index)
        + string.Format("<{0}>{1}</{0}>", mtc.Groups["tag"].Value, content)
        + FormatInput(input.Substring(mtc.Index + mtc.Length));
}

Test examples:

string output1 = FormatInput("{strong:Lorum ipsum dolar}");
// output1: <strong>Lorum ipsum dolar</strong>

string output2 = FormatInput("{strong:{italic:Lorum ipsum dolar}}");
// output2: <strong><italic>Lorum ipsum dolar</italic></strong>

string output3 = FormatInput("{strong:Lor{i:e}m ipsum dol{i:o}r}");
// output3: <strong>Lor<i>e</i>m ipsum dol<i>o</i>r</strong>

2 Comments

This has the same shortcoming that Christopher Creutzig pointed out early. It does not handle nested tags.
@FlyingStreudel Thanks for pointing out that. I've updated it and use a recursive method and non-greedy regex match to solve that.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.