How to search a utf8 string in word files using powershell [duplicate]

Question

I created a PowerShell script with assistance from GitHub Copilot. It works well with ASCII characters, but when I try to search for UTF-8 characters, it doesn’t return any results. For example, when I set the $searchWord variable to "YANI" the script performs as expected; however, when I change it to "KOLİ" it fails to find a match. How can I ensure that the script searches using UTF-8 encoding when working with Word files?

# Define the directory to search and the word to search for
$OutputEncoding = [Console]::OutputEncoding = [Text.UTF8Encoding]::UTF8
$directoryPath = "D:\BAKIM_ARIZA_TAKIP_FORMU\2024\AGUSTOS_AYI"
$searchWord = "KOLİ"

# Load the Word application
$word = New-Object -ComObject Word.Application
$word.Visible = $false

# Get all .docx files in the directory
$docxFiles = Get-ChildItem -Path $directoryPath -Filter *.doc

foreach ($file in $docxFiles) {
    # Open the document
    $document = $word.Documents.Open($file.FullName)
    
    # Search for the word
    $found = $false
    foreach ($range in $document.StoryRanges) {
        if ($range.Text -match [System.Text.Encoding]::UTF8.GetString([System.Text.Encoding]::UTF8.GetBytes($searchWord))) {
            $found = $true
            break
        }
    }
    
    # Output the file name if the word is found
    if ($found) {
        Write-Output "Found '$searchWord' in file: $($file.FullName)"
    }
    
    # Close the document
    $document.Close()
}

# Quit the Word application
$word.Quit()

You could try opening the document in UTF-8: $unused = [Type]::Missing; $word.Documents.Open($file.FullName, $unused, $unused, $unused, $unused, $unused, $unused, $unused, $unused, $unused, 65001). (see MsoEncoding), or you can try letting Word find the text using its Find.Execute() method. PS. Don't forget to clear memory with ReleaseComObject after quitting Word — Theo
– Theo, Commented Feb 19 at 10:14
Why are you using GetBytes on a string instead of bytes? $searchWord is a string, not Bytes? I think the culture is set wrong. The culture defines the ASCII characters 0x80 to 0xFF. See learn.microsoft.com/en-us/powershell/module/international/… Remove the Get-Bytes and go back to just string comparisons. — jdweng
– jdweng, Commented Feb 19 at 13:32
Since the problem occurs with string literals in your source code, the likeliest explanation is that your script file is misinterpreted by the Windows PowerShell engine, which happens if the script is saved as UTF-8 without a BOM (this is no longer a problem in PowerShell (Core) 7). Try saving your script as UTF-8 with BOM. See the linked duplicate for details. — mklement0
– mklement0, Commented Feb 20 at 1:40
As an aside: [System.Text.Encoding]::UTF8.GetString([System.Text.Encoding]::UTF8.GetBytes($searchWord)) is an unnecessary no-op. Character encodings only matter with respect to serialized representations of strings, such as when reading from and writing to a file. A .NET string (System.String) instance, i.e. an in-memory string representation, is inherently capable of representing all Unicode characters (and is internally composed of UTF-16 code units). — mklement0
– mklement0, Commented Feb 20 at 1:48

burnie · Accepted Answer · 2025-02-19 18:12:36Z

You have to re-save your PowerShell script as UTF-8 with BOM, otherwise the PowerShell engine will misinterpret any non-ASCII-range characters (such as İ) in the script.

If you need to use non-Ascii characters in your scripts, save them as UTF-8 with BOM. Without the BOM, Windows PowerShell misinterprets your script as being encoded in the legacy "ANSI" codepage. Conversely, files that do have the UTF-8 BOM can be problematic on Unix-like platforms. Many Unix tools such as cat, sed, awk, and some editors such as gedit don't know how to treat the BOM.

Source reference: https://learn.microsoft.com/en-us/powershell/module/microsoft.powershell.core/about/about_character_encoding

Btw, there is no need to explicitly set [Console]::OutputEncoding = [Text.UTF8Encoding]::UTF8 and no need to encode the string to bytes. You can simply use $range.Text -match $searchWord instead.

Collectives™ on Stack Overflow

How to search a utf8 string in word files using powershell [duplicate]

1 Answer 1

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Linked

Related