1

I am complete new in writting powershell scripts. So far I was using plain batch for my purpose as this is the requirement by my company. Inside this batch I am using nested foor loops to make a comparison of two .txt files, in detail I wantdo do the following:

  • File 1 contains lots of strings. Each string is in one seperate line with a preceded number and semicolon like so: 658;RMS
  • File 2 is some long text.

The aim is to count the amount of occurences of each string from File 1 in File 2, e.g. RMS is counted 300 times.

As my previous code hase some huge drawbacks concerning runtime (File 1 has approx. 400 lines and File 2 500.000) I read that the Select-String from Powershell is much more efficient. However, as I am reading some tutorials it is not clear to me how I can proceed here, beside that I have to run the powershellcode inside my .bat. My biggest problem is I am not sure how and where to place my 'variables', so the two inputfiles 1 and 2

So far I was testing the Select-String method like this:

powershell -command "& {Select-String -Path *.txt -Pattern "RMS"}"

My assumption would be to make use of piping, so something like this:

powershell -command "& {<<path to file one, should read line by line>> | Select-String -Path File2.txt -Pattern "value of file 1"}"

However, I am not getting this to work. Powershell is excpecting some kind of psobject before the first pipe?

3 Answers 3

3

For optimal performance, I would approach this task like so.

  • Read the file with the terms as a CSV (it is a CSV, with a ; delimiter)
  • Read the other file into a string
  • For each term, count how often it can be found in the target string (using .IndexOf())

For example

$data = Import-Csv "file1.txt" -Delimiter ";" -Header ID,Term 
$target = Get-Content "file2.txt" -Raw
$counts = @{}

foreach ($term in $data.Term) {
    $index = -1
    $count = 0
    do {
        $index = $target.IndexOf($term, $index + 1)
        if ($index -gt -1) { $count++ } else { break; }
    } while ($true);
    $counts[$term] = $count
}

$counts 

Notes

  • Import-Csv will automatically use the first line in the input file as the header. If your file already has a header, you can remove the -Headers parameter.
  • Get-Content will will read the input file into an array of lines by default. But for this approach, having the entire file as one big string is the right thing - that's what -Raw does.
  • @{} creates an empty hashtable
  • $data.Term will access one column of the CSV
  • .IndexOf() is case sensitive. By default, PowerShell is case-insenstive, but native .NET methods like this one will not change their behavior. This might or might not be what you need - use .ToLower() on the $target and the $term if you don't care for case.
Sign up to request clarification or add additional context in comments.

9 Comments

I tested your approach also and its quite fast :). is there some easy modification so it will only save those terms with a count higher than zero to $counts? More, I have to modify the search expression with regular expressions so it only counts exact matches. As i am not familiar with powershell, where would be the right point to add this in your code?
"is there some easy modification so it will only save those terms with a count higher than zero to $counts" - Yes, there is, and I am sure you will find it. It's not difficult. :) -- "I have to modify the search expression with regular expressions so it only counts exact matches." - Huh? The above code does only count exact matches. Regular expressions are for situations where you don't want exact matches.
Oh okay I forgot to mention many sorry for this. In my file2 there are severla lines. For example I want to count the occurence of 'RM4' Now there can exists the following lines: 123456789 RM4 987654321 -> should be counted as 1 However, the occurence in this line should not be counted: 12345 RM4.DLL 9876 So my aim was to capsulate the search term in white spaces so it is not followed by anything else :)
Great I will try my best thank you for all your help, indeed that should be not so difficult
oh okay first thing was easy. I understood you code, very intelligent approach you did. I now added if ($count -gt 0) {$counts[$term] = $count} and replaced $counts[$term] = $count with this
|
2

Select-String is useful, but it isn't magic :)

Performance impact in mind, I would approach it like this:

  • For each line in File2:
    • Test for occurences of all terms in File1

This way, you only need to read and evalulate File2 once:

# prepare hashtable to keep track of count
$count = @{}

# read terms to search for from file1
$termsToFind = Get-Content .\file1 |ForEach-Object {
  $_ -split ';' |Select -Last 1
}

# loop over lines in file2, count the words we're searching for
Get-Content .\test\file2 |ForEach-Object {
  foreach($term in $termsToFind){
    # Using `Regex.Matches()` will help us find multiple occurrences of the same term
    $count[$term] += [regex]::Matches($_,"\b$([regex]::Escape($term))\b").Count
  }
}

Now $count will be a hashtable where the key is the term from file1, and the value is the count of each word.

Output to the same format as file1 with:

$count.GetEnumerator() |ForEach-Object { $_.Value,$_.Key -join ';' } |Set-Content output.txt

11 Comments

nice, but you forgot, that the content of file1 is CSV-like 658;RMS where he only needs the second column.
@T-Me thanks for spotting it, completely forgot that part :)
many thanks @MathiasR.Jessen I tested the first part where you read in file1 and this works perfectly fine. However trying to impement the second part reults in some rror cases. Did this occur beacuse I trying to fit the whole code in one line as I cannot run external powershell scripts? My code looks like this:
powershell -command "& {$count = @{}; $termsToFind = Get-Content 'ModulID.txt' |ForEach-Object {$_ -split ';' |Select -Last 1}; Get-Content 'TlsTrace.prn' |ForEach-Object {foreach($term in $termsToFind){$count[$term] += [regex]::Matches($_,"\b$([regex]::Escape($term))\b").Count}}}" I changed the filenames to the real one, the other things are identical
I'd strongly suggest either putting the code in a .ps1 script file and then run powershell -file C:\path\to\script.ps1, or turning it into an encoded command
|
1

If you check the docs, you can't pipe -pattern to select-string. You can use parentheses to make the output of something become the pattern argument:

powershell select-string -pattern (get-content file1) -path file2    

Using the fact that pattern is position 0 and path is position 1. -pattern can also be an array.

powershell select-string (get-content file1) file2  

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.