Replacing specific Unicode characters in strings read from Excel

Question

I am attempting to replace some undesirable characters in a string retrieved from an Excel spreadsheet. The reason being that our Oracle database is using the WE8ISO8859P1 character set, which does not define several characters that Excel "helpfully" inserts for you in text (curly quotes, em and en dashes, etc.) Since I have no control over the database or how the Excel spreadsheets are created I need to replace the characters with something else.

I retrieve the cell contents into a string thus:

string s = xlRange.get_Range("A1", Missing.Value).Value2.ToString().Trim();

Viewing the string in Visual Studio's Text Visualiser shows the text to be complete and correctly retrieved. Next I try and replace one of the undesirable characters (in this case the right-hand curly quote symbol):

s = Regex.Replace(s, "\u0094", "\u0022");

But it does nothing (Text Visualiser shows it still to be there). To try and verify that the character I want to replace is actually in there, I tried:

bool a = s.Contains("\u0094");

but it returns false. However:

bool b = s.Contains("”");

returns true.

My (somewhat lacking) understanding of strings in .NET is that they're encoded in UTF-16, whereas Excel would probably be using ANSI. So does that mean I need to change the encoding of the text as it comes out of Excel? Or am I doing something else wrong here? Any advice would be greatly appreciated. I have read and re-read all articles I can find about Unicode and encoding but am still none the wiser.

Community · Accepted Answer · 2017-05-23 11:56:58Z

4

Yes strings in .Net are UTF-16.

You're doing it right; perhaps your hex-math is incorrect. The character you tested for isn't "\u0094" (Not sure that's what you meant). The following worked for me:

((int)"”"[0]).ToString("X") returns "201D"

"”" == "\u201D" returns true

"\u0094" == "" (right hand side is the empty string) returns false

A lot of UTF-16 characters will seem as an empty string by the text visualizer but they can either be an undisplayable character or part of a surrogate (i.e. Some characters may need to be typed "\UXXXXXXXX" while others you can do with (four digits) "\uXXXX".). My knowledge of this domain is very limited.

References - Jon Skeet's articles on:

edited May 23, 2017 at 11:56

CommunityBot

11 silver badge

answered Oct 3, 2011 at 7:11

Jonno

1,9762 gold badges21 silver badges21 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Sid Holland Over a year ago

Yes, you're quite right, my hex values were way off. Turns out I was using code points for the WIN1252 character set instead of UTF-16. One day I'll understand all of this (upon which it will be replaced by something vastly more complicated in incomprehensible). Thanks for the answer.

Mihai Nita · Accepted Answer · 2011-10-07 08:48:41Z

2

You can use NVARCHAR and NTEXT instead of VARCHAR and TEXT for the columns that need to accomodate those characters. That wayyou don't have to convert the whole database, and you are future proof, because the columns will be Unicode.

answered Oct 7, 2011 at 8:48

Mihai Nita

5,83730 silver badges27 bronze badges

1 Comment

Sid Holland Over a year ago

Yes, that would be ideal, but sadly I have no control over the database.

Collectives™ on Stack Overflow

Replacing specific Unicode characters in strings read from Excel

2 Answers 2

1 Comment

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related