2

I am converting Strings with weird symbols I don't want into Latin-1 (or at least, what Microsoft made of it) and back into a string. I use PowerShell, but this is only about the .NET Methods:

    $bytes = [System.Text.Encoding]::GetEncoding(1252).GetBytes($String)
    $String = [System.Text.Encoding]::GetEncoding(1252).GetString($bytes)

This works pretty weird, except the weird symbols don't get removed, but question marks are created, for example:

"Helloäöü?→"

becomes

"Helloäöü?????"

What I want is to only convert valid bytes, without creating question marks, so the output will be:

"Helloäöü?"

Is that possible? I searched a bit already, but couldn't find anything. ChatGPT lies to me and says there would be a "GetValidBytes" method, but there isn't...

2
  • What is the original $String value? At least one example would be helpful :) Commented Dec 12, 2022 at 17:25
  • Something like "Helloäöü?→" -replace '[^\x09-\x0D\x20-\xFF]' perhaps? Commented Dec 12, 2022 at 20:26

1 Answer 1

2

One option is to use a regex-based -replace operation based on named Unicode blocks:

"Helloäöü€?→" -creplace '[^\p{IsBasicLatin}\p{IsLatin-1Supplement}–—€‚‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•˜™š›œžŸ]'

Given that your input already is a .NET string (and therefore composed of UTF-16 code units), there's no strict need for conversion to and from bytes:

  • \p{IsBasicLatin} and \p{IsLatin-1Supplement matches characters that fall into the ISO-8859-1 Unicode subrange, which is mostly the same as Windows-1252, but is missing a few characters.

  • The explicitly enumerated characters (€...) are those Windows-1252 characters not present in ISO-8859-1 (which therefore have different code points in Unicode than in Windows-1252, namely outside the 8-bit range).

    • and (en dash and em dash) are placed first, so that they aren't mistaken for describing a range of characters (the .NET regex engine apparently allows their interchangeable use with -, the regular "dash" (ASCII-range hyphen).
    • (single low-9 quotation mark) is doubled in order to escape it, because PowerShell allows its interchangeable use with ' (single quotes) - see also: this answer summarizes all such interchangeable uses allowed in PowerShell.

By replacing all non-matching (^) characters with the (implied) empty string, all non-Windows-1252 characters are effectively removed.

A general caveat:

  • Due to the use of literal non-ASCII-range characters in the command, be sure that PowerShell interprets your script file's character encoding correctly, which notably means using UTF-8 files with BOM for the benefit of Windows PowerShell - see this answer.

However, your to-and-from-bytes encoding approach can be used with a slight adaptation, which works with any target encoding (without needing to enumerate individual characters, such as above):

Using a System.Text.EncoderReplacementFallback instance initialized with the empty string effectively removes all characters that cannot be represented in the target encoding.

$string = "Helloäöü€?→"

$encoding = [System.Text.Encoding]::GetEncoding(
  1252,
  # Replace non-Windows-1252 chars. with '' (empty string), i.e. *remove* them.
  [System.Text.EncoderReplacementFallback]::new(''),
  [System.Text.DecoderFallback]::ExceptionFallback # not relevant here
)

$string = $encoding.GetString($encoding.GetBytes($string))
Sign up to request clarification or add additional context in comments.

2 Comments

This is a much more elegant solution. Do you have a specific online ressource to recommend that shows the differences between all those encodings? Also, thanks for upvoting, I can finally comment now :D
@MySurmise, re up-voting privilege :) Please see my update re covering all Windows-1252 characters. As for resources: Wikipedia is a good source; I've added a link to Wikipedia's Windows-1252 article to the answer.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.