I created a PowerShell script with assistance from GitHub Copilot. It works well with ASCII characters, but when I try to search for UTF-8 characters, it doesn’t return any results. For example, when I set the $searchWord variable to "YANI" the script performs as expected; however, when I change it to "KOLİ" it fails to find a match. How can I ensure that the script searches using UTF-8 encoding when working with Word files?
# Define the directory to search and the word to search for
$OutputEncoding = [Console]::OutputEncoding = [Text.UTF8Encoding]::UTF8
$directoryPath = "D:\BAKIM_ARIZA_TAKIP_FORMU\2024\AGUSTOS_AYI"
$searchWord = "KOLİ"
# Load the Word application
$word = New-Object -ComObject Word.Application
$word.Visible = $false
# Get all .docx files in the directory
$docxFiles = Get-ChildItem -Path $directoryPath -Filter *.doc
foreach ($file in $docxFiles) {
# Open the document
$document = $word.Documents.Open($file.FullName)
# Search for the word
$found = $false
foreach ($range in $document.StoryRanges) {
if ($range.Text -match [System.Text.Encoding]::UTF8.GetString([System.Text.Encoding]::UTF8.GetBytes($searchWord))) {
$found = $true
break
}
}
# Output the file name if the word is found
if ($found) {
Write-Output "Found '$searchWord' in file: $($file.FullName)"
}
# Close the document
$document.Close()
}
# Quit the Word application
$word.Quit()
$unused = [Type]::Missing; $word.Documents.Open($file.FullName, $unused, $unused, $unused, $unused, $unused, $unused, $unused, $unused, $unused, 65001). (see MsoEncoding), or you can try letting Word find the text using its Find.Execute() method. PS. Don't forget to clear memory with ReleaseComObject after quitting Word[System.Text.Encoding]::UTF8.GetString([System.Text.Encoding]::UTF8.GetBytes($searchWord))is an unnecessary no-op. Character encodings only matter with respect to serialized representations of strings, such as when reading from and writing to a file. A .NET string (System.String) instance, i.e. an in-memory string representation, is inherently capable of representing all Unicode characters (and is internally composed of UTF-16 code units).