2

I am new at PowerShell and have not found a Stack Overflow question or a documentation reference that gets me all the way to a successful outcome. If a question or documentation reference already exists that answers this that I overlooked I would be grateful to know.

In a text file is a string like this:

<span><span><span><span><span></span></span></span></span></span>

The number of <span> and </span> varies from file to file. For example, in some files it is like this:

<span></span>

Yet in others it is like this:

<span><span></span></span>

And so on. There are likely never going to be more than 24 of each in a string.

I want to eliminate all strings like this in the text file, yet preserve the </span> in strings like this:

<span style="font-weight:bold;">text</span>

There may be many variations on that kind of string in the text file; for example, <span style="font-size: 10px; font-weight: 400;">text</span> or <span style="font-size: 10px; font-weight: 400;">text</span> and I don't know beforehand what variation(s) will be included in the text file.

This partially works...

$original_file = 'in.txt'
$destination_file = 'out.txt'

(Get-Content $original_file) | Foreach-Object {
    $_ -replace '<span>', '' `
       -replace '</span>', ''
} | Set-Content $destination_file

...but obviously results in something like <span style="font-weight:bold;">text.

In the PowerShell script above I can use

    $_ -replace '<span></span>', '' `

But of course it only catches the <span></span> in the middle of the string because, as it is written now, it does not loop.

I know it is silly to do something like this

$original_file = 'in.txt'
$destination_file = 'out.txt'

(Get-Content $original_file) | Foreach-Object {
    $_ -replace '<span></span>', '' `
       -replace '<span></span>', '' `
       -replace '<span></span>', '' `
       -replace '<span></span>', '' `
       -replace '<span></span>', '' 
} | Set-Content $destination_file

So because the <span> string collapses into itself each time the script is run, producing a new inner <span></span> that can then be removed, the best solution I can think of is to loop the script over the file until it recognizes that all instances of <span></span> are gone.

I feel like adding logic along these lines is necessary:

   foreach($i in 1..24){
    Write-Host $i

But have not been able to successfully incorporate it into the script.

If this is the wrong approach entirely I would be grateful to know.

The reason for PowerShell is that my team prefers it for scripts included in an Azure DevOps release pipeline.

Thanks for any ideas or help.

5 Answers 5

1

If you just want to remove any number of empty spans use a Regular Expression with a group and a quantifier:

$original_file = 'in.txt'
$destination_file = 'out.txt'

(Get-Content $original_file) -replace "(<span>)+(</span>)+" | 
 Set-Content $destination_file
Sign up to request clarification or add additional context in comments.

Comments

1

Try the following .. i've added some comments to clearify things.

# always use absolute paths if possible
$original_file = 'c:\tmp\in.txt'
$destination_file = 'c:\tmp\out.txt'

$patternToBeRemoved = '<span></span>'

# store the file contents in a variable
$fileContent = Get-Content -Path $original_file

# save the result of these operations in a new variable and iterate through each line
$newContent = foreach($string in $fileContent) {
    # while the pattern you don't want is found it will be removed
    while($string.Contains($patternToBeRemoved)) {
        $string = $string.Replace($patternToBeRemoved, '')
    }
    # when it's no longer found the new string is returned
    $string
}

# save the new content in the destination file
Set-Content -Path $destination_file -Value $newContent

1 Comment

Thank you for the explanatory comments @guenther
0
$original_file = 'in.txt'
$destination_file = 'out.txt'

ForEach ($Line in (Get-Content $original_file)) {
    Do {
        $Line = $Line -replace '<span></span>',''
    } While ($Line -match '<span></span>')
    Set-Content -Path $destination_file -Value $Line 
}

Comments

0

You can use a regular expression together with the -replace operator to strip all <span>optional content</span> pairs from a string. That means all pairs where the opening tag does not specify any attributes.

$content = '<span></span><span><span><span style="font-weight:bold;">Foo</span></span></span>'
$regex = '<span>(.*?)</span>'    
while ($content -match $regex)
{
    $content = $content -replace $regex,'$1'
}
Write-Output $content

The result will be:

<span style="font-weight:bold;">Foo</span>

The while loop takes care of your nested occurrences of the <span></span> pair.

Comments

0
$content = '<span></span><span><span><span style="font-weight:bold;">Foo</span></span></span>'
$regex = '<span\s+[^<]+</span>'
$null = $content -match $regex

$Matches[0]

1 Comment

Welcome to Stack Overflow. While this code may answer the question, providing additional context regarding why and/or how this code answers the question improves its long-term value.How to Answer

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.