2

I'm trying to match a regex and replace the match in a file. My regex is as follows (which matches fine):

$regex1 = [regex] "index.php\?title\=[a-zA-Z0-9_]*"

a redacted excerpt of the source file I'm trying to run the replace in:

<content:encoded>
    <![CDATA[<a href="http://[redacted]/index.php?title=User_Manual">
    <a href="http://[redacted]/index.php?title=User_Manual">The software</a>, running on the 
    <a href="http://[redacted]/index.php?title=Mobile_Device">POS Device</a>, enables 
    <a href="http://[redacted]/index.php?title=Logging_In">log in</a>, 
    <a href="http://[redacted]/index.php?title=Selecting_Journey">select a journey</a>

and the Powershell replacement:

.Replace("index.php?title=","").Replace("_","-").ToLower())

I've extracted all the matches, cast the $allmatches array to a new array (so it would be writable), and then updated the values in the new array. I cannot work out how to write this back to the file, and don't seem to be able to find any posts or documentation to help with this. My code to date:

$regex1 = [regex] "index.php\?title\=[a-zA-Z0-9_]*"

$contentOf=Get-Content $contentfile
$allmatches=$regex1.Matches($contentOf)
$totalcount=$allmatches.Count

$newArray = $allmatches | select *

for($i=0;$i -le $totalCount;$i++) {
    $newvalue=(($allmatches[$i].Value).Replace("index.php?title=","").Replace("_","-").ToLower())
    $newArray[$i].Value = $newvalue
}

At this point I have an array $newArray with all the regex matches and replacements, but no idea how to write this back to my file/variable e.g $newarray[0]:

Groups   : {0}
Success  : True
Name     : 0
Captures : {0}
Index    : 4931
Length   : 40
Value    : user-manual

Of course I may be going about this completely the wrong way. As far as why I've chosen Powershell to do this, is simply because that is where I've spent most time scripting these days...of course I'm sure it would be achievable in shell (it would just take me longer to get there).

2 Answers 2

3

This is actually a good place to use capturing groups in your regex and .Net Substitutions in Regular Expressions. The modified regular expression is:

$regex = [regex] 'index\.php\?title\=(\p{L}*)_(\p{L}*)'
  • \p{L} matches any letter (as defined by Unicode, not just A-Z).
  • (\p{L}*)is a numbered capture group that contains only letters.
  • The replacement pattern string would use $1 and $2 to refer to each capturing group: '$1-$2'. Note the use of single quotes '' on the replacement string to prevent PowerShell variable expansion on $1and $2.

Simple substitution

If we only cared about the capture groups as-is we could just use this code:

    $testContent = @'
<content:encoded>
    <![CDATA[<a href="http://[redacted]/index.php?title=User_Manual">
    <a href="http://[redacted]/index.php?title=User_Manual">The software</a>, running on the
    <a href="http://[redacted]/index.php?title=Mobile_Device">POS Device</a>, enables
    <a href="http://[redacted]/index.php?title=Logging_In">log in</a>, 
    <a href="http://[redacted]/index.php?title=Selecting_Journey">select a journey</a>
    '@
    $regex = [regex] 'index\.php\?title\=(\p{L}*)_(\p{L}*)'
    $modifiedContent = [regex]::Replace($testContent, $regex, '$1-$2')

Which results in:

<content:encoded>
<![CDATA[<a href="http://[redacted]/index.php?title=User_Manual">
<a href="http://[redacted]/index.php?title=User_Manual">The software</a>, running on the
<a href="http://[redacted]/index.php?title=Mobile_Device">POS Device</a>, enables
<a href="http://[redacted]/index.php?title=Logging_In">log in</a>, 
<a href="http://[redacted]/index.php?title=Selecting_Journey">select a journey</a>

The issue with this approach is that does not allows us to change the groups to lowercase. Regular expressions don't really have a way to deal with this requirement. Fortunately, .Net has an extension that allows us to easily take care of more complex situations.

Using a MatchEvaluator delegate

A MatchEvaluator is an object that can be used with overloads of the regex replace method for situations where normal substitutions fall short. In PowerShell they can be a simple scriptblock with a [Match] parameter:

    $testContent = @'
    <content:encoded><![CDATA[<a href="http://[redacted]/index.php?title=User_Manual">
   <content:encoded>
    <![CDATA[<a href="http://[redacted]/index.php?title=User_Manual">
    <a href="http://[redacted]/index.php?title=User_Manual">The software</a>, running on the
    <a href="http://[redacted]/index.php?title=Mobile_Device">POS Device</a>, enables
    <a href="http://[redacted]/index.php?title=Logging_In">log in</a>, 
    <a href="http://[redacted]/index.php?title=Selecting_Journey">select a journey</a>
    '@
    $regex = [regex] 'index\.php\?title\=(\p{L}*)_(\p{L}*)'
    $MatchEvaluator = {
        param($match)    
        $group1 = $match.Groups[1].Value.toLower()
        $group2 = $match.Groups[2].Value.toLower()
        return "$group1-$group2"
    }
    [regex]::Replace($testContent, $regex, $MatchEvaluator)

Which gives the desired result:

<content:encoded>
    <![CDATA[<a href="http://[redacted]/index.php?title=User_Manual">
    <a href="http://[redacted]/index.php?title=User_Manual">The software</a>, running on the
    <a href="http://[redacted]/index.php?title=Mobile_Device">POS Device</a>, enables
    <a href="http://[redacted]/index.php?title=Logging_In">log in</a>, 
    <a href="http://[redacted]/index.php?title=Selecting_Journey">select a journey</a>

Replacing the contents of a file

The final code would look like this:

# Load the file as a single string
$content = Get-Content $contentfile -Raw

# Regex to replace, with capturing groups
$regex = [regex] 'index\.php\?title\=(\p{L}*)_(\p{L}*)'

# Delegate to transfrom capture groups into lowercase
$MatchEvaluator = {
    param($match)
    $group1 = $match.Groups[1].Value.toLower()
    $group2 = $match.Groups[2].Value.toLower()
    return "$group1-$group2"
}

# Replace all matches of the regular expression with delegate
$modifiedContent = [regex]::Replace($Content, $regex, $MatchEvaluator)

# Overwrite existing file
$modifiedContent | Out-File $contentfile
Sign up to request clarification or add additional context in comments.

2 Comments

Thank you for a very thorough response. The use of capture groups here enabled better control over the replacement needed, and as such I applied this solution with minimal edits for my use case.
BEST ANSWER for someone who has spent 2 hours googling this stuff.
2

I've extracted all the matches, cast the $allmatches array to a new array (so it would be writable), and then updated the values in the new array.

You don't need to do this, the problem is much simpler to solve. All you need to do is use Get-Content on the original file, and iterate over each line. You can also use the -replace operator instead of the [Regex] class to handle the replacement:

Get-Content $contentFile | Foreach-Object {
  $_ = ( $_ -replace 'index.php\?title=' ) -replace '_', '-'
} | Set-Content $contentFile

You can directly pipe the result of Get-Content to Foreach-Object. For each line, we want to replace index.php\?title= with an empty string (you can omit the second argument to -replace as shorthand for this). Then we also replace the _ with - for that line. It does this against each line in the file. The changed content is then piped to Set-Content, where it is written back to the original file.


As an aside when you use the -match operator (we didn't use it above) to match on a regular expression, you can inspect the automatic $Matches variable to learn more about how the expression was matched against the string, which is similar to what is returned by [Regex]::Matches

2 Comments

Thank you for your response. Your post made me re-evaluate how I was tackling this, as clearly I had led myself down a rabbit hole with over complicating what is indeed a far simpler solution. I amended your code suggestion to the following: Get-Content $contentFile | Foreach-Object { if($_ -match 'index.php\?title=') { $_ = (( $_ -replace 'index.php\?title=' ) -replace '_', '-').ToLower() | Out-File $newFile -Append } else { $_ | Out-File $newFile -Append } } and this is now working as desired.
I later realised a pitfall in this method due to the unconditional replacement of all _'s with -'s in your original and then all lower case being applied on a line where the match was made in my edit. At this stage, I thought the simpler method didn't allow for the finer control needed in the pattern matching.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.