0

I've got strings like these:

| Released    = {{start-date|June 14, 1972}}
| Released    = {{Start date|1973|03|01|df=y}} 

I'd like to replace all | within {{ }} with ^

| Released    = {{start-date^June 14, 1972}}
| Released    = {{Start date^1973^03^01^df=y}} 

I can't use substring replacement because there are | symbols outside {{ }}, which must be left intact. And because I don't know exactly how many parts does the string in {{ }} have, I can't use something like s/{{(.+?)\|(.+?)}}/{{$1^$2}}/.

I suppose I need to use some kind of recursion here?

4
  • I'll work on an answer momentarily, but I just couldn't resist the comment "you are scraping something from Wikipedia, right?" Commented Jul 12, 2011 at 5:54
  • And if that's the case, it will probably behoove you to know that the only 100%-working parser for wikitext is MediaWiki itself (and even that can be buggy sometimes). Commented Jul 12, 2011 at 6:01
  • Yes. I know about dbpedia.org and some other similar resources but can't use them because data is too outdated for my task. Commented Jul 12, 2011 at 6:02
  • I don't need to parse all wikitext features and definitely don't need to render it to HTML… Commented Jul 12, 2011 at 6:03

3 Answers 3

5

A simple solution:

s/\|(?=[^{}\n]*}})/^/g

Even simpler solution, but probably broken in many cases:

s/(?!^)\|/^/gm

Here is a bit more robust regex:

s/(?:\G(?!^)(?:(?>[^|]*?}})(?>.*?{{))*|^(?>.*?{{))(?>[^|]*?(?=}}|\|))\K\|(?=.*?}})/^/gs;

Commented:

s/
(?:
  \G(?!^)                       # inside of a {{}} tag
  (?: (?>[^|]*?}}) (?>.*?{{) )* # read till we find a | in another tag if none in current
  |
  ^(?>.*?{{)                    # outside of tag, parse till in
)
(?> [^|]*? (?=}}|\|) )          # eat till a | or end of tag
\K                              # don't include stuff to the left of \K in the match
\|                              # the |
(?=.*?}})                       # just to make sure the tag is closed
/^/gsx;

Input:

|}}
| Re|eased    = {{start-date|June 14^, {|1972}|x}}
| Released    = {{Start date}|1973|03|01}|df=y|}}
| || {{|}} {{ |

Output:

|}}
| Re|eased    = {{start-date^June 14^, {^1972}^x}}
| Released    = {{Start date}^1973^03^01}^df=y^}}
| || {{^}} {{ |

Example: http://ideone.com/fbY2W

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks a lot! Didn't know about lookaheads :( Slightly modified to my needs — works fine!
+1 for a nice obfuscated code exammple, 12 lines of comments and I still cannot understand it. I think one line per character would do it :D
3

This may not be the most concise way to do it, but it's the first working method I came up with.

my $new;
for ( split /({{.*?}})/ ) {
    s/\|/^/g if /^{{/;
    $new .= $_;
}
$_ = $new;

4 Comments

This does not work because you have a greedy .* in your split regex. When I apply it to "A {{b|c}} c|d {{e|f|}} g h" I get "A {{b^c}} c^d {{e^f^}} g h" which incorrectly turns the pipe between c and d into a caret.
Replace the split with split /({{.*?}})/, that should do it.
@Ray Toal: Good catch. Updated my answer.
@Flimzy I knew you would. +1 now. Nice answer. I was sort of surprised it works without escaping the left brace; technically that is a metacharacter.
2
s{({{.*?}})}
 {my $x = $1;
  $x =~ tr/|/^/;
  $x
 }ge;

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.