1

i'm trying to delete the "unwanted" class lines from an HTML file using power shell script

<a class="unwanted" href="http://www.mywebsite.com/rest/of/url1" target="_blank">my_file_name1</a><br>
<a class="mylink" href="http://www.mywebsite.com/rest/of/url2" target="_blank">my_file_name2</a><br>
<a class="unwanted" href="http://www.mywebsite.com/rest/of/url3" target="_blank">my_file_name3</a><br>

Currently i'm replacing strings using this script

$s = "old string"
$r = "new string"

Get-ChildItem "C:\Users\User\Desktop\Folder" -Recurse -Filter *.html | % {
  (Get-Content $_.FullName) `
    | % { $_ -replace [regex]::Escape($s), $r } `
    | Set-Content $_.FullName
}

3 Answers 3

2

Since you tagged your question also with and , I want to contribute a related answer.

cmd.exe/batch scripting does not understand HTML file format, but if your HTML file(s) look(s) like the sample data you provided (the <a> tag and the corresponding </a> tag are in a single line, and there is nothing else (than <br>)), the following command line could work for you -- supposing a HTML file to process is called classes.html and the modified data is to be written to file classes_new.html:

> "classes_new.html" findstr /V /I /L /C:"class=\"unwanted\"" "classes.html"

This only works if the string class="unwanted" occurs only in the <a> tags that need to be removed.


To process multiple files, the following batch script could be used, based on the above command line:

@echo off
setlocal EnableExtensions DisableDelayedExpansion

set "ARGS=%*"
setlocal EnableDelayedExpansion
for %%H in (!ARGS!) do (
    endlocal
    call :SUB "%%~H"
    setlocal
)
endlocal

endlocal
exit /B

:SUB file
if /I not "%~x1"==".html" if /I not "%~x1"==".htm" exit /B 1
findstr /V /I /L /C:"class=\"unwanted\"" "%~f1" | (> "%~f1" find /V "")
exit /B

The actual removal of lines is done in the sub-routine :SUB, unless then file name extension is something other than .html or htm. The main script loops through all the given command line arguments and calls :SUB for every single file. Note that this script does not create new files for the modified HTML contents, it overwrites the given HTML files.

Sign up to request clarification or add additional context in comments.

Comments

1

Removing lines is even easier than replacing them. When outputting to Set-Content, simply omit the lines that you want removed. You can do this with Where-Object in place of your Foreach.

Adapting your example:

$s = "unwanted regex"

Get-ChildItem "C:\Users\User\Desktop\Folder" -Recurse -Filter *.html | % {
  (Get-Content $_.FullName) `
    | where { $_ -notmatch $s } `
    | Set-Content $_.FullName
}

If you want literal matching instead of regex, substitute the where clause

where { -not $_.Contains($s) } `

Note this is using the .NET function [String]::Contains(), and not the PowerShell operator -contains, as the latter doesn't work on strings.

2 Comments

That's exactly what i was looking for, +1
I've used .*unwanted as regex
-1

Try using multiline strings for your $s and $r. I tested with the HTML examples you posted as well and that worked fine.

$s = @"
old string
"@
$r = @"
new string
"@

Get-ChildItem "C:\Users\User\Desktop\Folder" -Recurse -Filter *.html | % {
  (Get-Content $_.FullName) `
    | % { $_ -replace $s, $r } `
    | Set-Content $_.FullName
}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.