3

i have some html source code of customer data that needs to be cleaned from html tags before deployed with a line joining string split.

i want to be able to target specific types of information. if for example a customer has a list of categories on his page. each 'category' sits, perched inside of an easily distinguishable tag:

<span _ngcontent-jal-c67="" class="category-name">Cryptocurrency</span>

would it be possible to remove everything else that is not nested inside a similar html tag?

let's say, for exampple i want evrything thats occurs inside of <span *>*</span>. so that every non <span></span> tag and its contents would be removed. the contents of all the <span ***>***</span> would stay, without the tag. is that something i could do in powershell? let's avoid paste.exe and cygwin type of stuff. i'm looking for standard native windows approach (cmd or powershell).

again, i want to remove all tags.

just the contents that i don't remove should be limited to those found in a specific tag. like ,<span _ngcontent-jal-c68="" class="category-name">Shopping</span> everything that fits the <span *>*</span> profile

leave only the contents. no tag.

from: <span _ngcontent-jal-c32="" class="category-name">Home and Graden</span>

to: Home and Graden

i'm really looking for an answer for how to do this in powershell without needing to install anything or to make any interesting changes to the OS (windows10)

2 Answers 2

1

Instead of using delicate Regular Expressions, you might just use the [System.Net.WebUtility]::HtmlDecode method for this:

$Html = '<span _ngcontent-jal-c67="" class="category-name">Cryptocurrency</span>'
([Xml][System.Net.WebUtility]::HtmlDecode($Html)).GetElementsByTagName('span').'#text'

Result:

Cryptocurrency
Sign up to request clarification or add additional context in comments.

2 Comments

i tried this but i couldn't figure out how to pass multiple parameters. what if my path is /html/body/app-root/ng-component/div[2]/ng-component/ng-component/div[1]/section[2]/ng-component/ng-component/app-loader-box/div/div/div[1]/div[2]/div/app-url-list-table/div/div/app-url-list-form/tr[2]/td[1]/div/span commercial grade websites usually have way to much data use only one element. i have to be able to pass multiple tags into the function if i want to get anywhere. @iRon
Please, add an example of a more complex html element (and the expected results) to the question or create a new question.
1

Please try to investigate into the problem before asking on Stackoverflow. Did you know there is a -replace operator in PowerShell which allows you to use RegEx? Did you identify that RegEx might help you with your problem?

Anyway, here is one approach, you could take.

$html = '<span _ngcontent-jal-c32="" class="category-name">Home and Graden</span>'
if ($html -match '(<span.*>)(?<Category>.+)(</span>)') { 
    $Matches.Category 
}

Home and Graden

The -match operator can test for a RegEx. The RegEx (<span.*>)(?<Category>.+)(</span>) will create three groups, one of which is named Category. The category sits in between the span-tags. For your input, you have to be sure that any categories will sit inside of a span-tag. If -match returns true, the automatic variable $Matches is filled. Since we named second group Category, we can easily access it as a property with $Matches.Category.

Alternatively, and for more complex html files even preferrably, you can parse html with PowerShell, see Powershell Tip : Parsing HTML from a local File or a String

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.