3

i'm facing some encoding issue which i'm not able to find the correct solution.

I have a C# TCP server, running as a window service which received and respond XML, the problem comes down when passing special characters in the output such as spanish characters with accents (like á,é,í and others).

Server response is being encoded as UTF-8, and java client is reading using UTF-8. But when i print its output the character is totally different.

This problem only happens in Java client(C# TCP client works as expected).

Following is an snippet of the server code that shows the encoding issue: C# Server

   byte[] destBytes = System.Text.Encoding.UTF8.GetBytes("á");
    try
    {
       clientStream.Write(destBytes, 0, destBytes.Length);
       clientStream.Flush();
    }catch (Exception ex)
    {
       LogErrorMessage("Error en SendResponseToClient: Detalle::", ex);
    }

Java Client:

socket.connect(new InetSocketAddress(param.getServerIp(), param.getPort()), 20000);
InputStream sockInp = socket.getInputStream();
InputStreamReader streamReader = new InputStreamReader(sockInp, Charset.forName("UTF-8"));
sockReader =  new BufferedReader(streamReader);
String tmp = null;
while((tmp = sockReader.readLine()) != null){
  System.out.println(tmp);
}

For this simple test, the output show is:

ß

I did some testing printing out the byte[] on each language and while on C# á output as: 195, 161

In java byte[] read print as: -61,-95

Will this have to do with the Signed (java), UnSigned (C#) of byte type?.

Any feedback is greatly appreciated.

3
  • Not an answer, but a datapoint anyways - python does decode the C# version as you intended: print ''.join(chr(x) for x in [195, 161]).decode('utf-8') -> á. The java's one is not a valid utf-8 apparently if I try to preserve that order. Commented Aug 28, 2011 at 0:59
  • Thanks, i'm still experimenting. (no luck so far). Commented Aug 28, 2011 at 1:08
  • i made a mistake in the aboves example (i already edit it), In java byte[] print as: -61,-95. Which is a valid UTF8 character. The problem seems to lies in the OS (window) itself. I dont know what weird settings it haves that prints the wrong character. Commented Aug 28, 2011 at 14:48

2 Answers 2

1

To me this seems like an endianess problem... you can check that by reversing the bytes in Java before printing the string...

which usually would be solved by including a BOM... see http://de.wikipedia.org/wiki/Byte_Order_Mark

Sign up to request clarification or add additional context in comments.

6 Comments

Im under the same impression, after reading how about Endian in C# and Java.
If it's utf-8, then BOM is not needed and will not change anything. utf-8 encoding always has the same representation - on little and big endian machines. (unicode.org/faq/utf_bom.html#bom5)
I think the problem may be in SO where the server is running, creating a simple java programa that should print -> á and running it there is printing the weird character as well, while on other OS (linux) it prints correctly the expected character. So i just discarded the Socket and encoding from End To End.
if the OS has some weird settings this could happen :-(
Any suggestion where should i look at in the OS setting? Regional Settings?
|
0

Are you sure that's not a unicode character you are attemping to encode to bytes as UTF-8 data?

I found the below has a useful way of testing to see if the data in that string is ccorrect UTF-8 before you send it.

How to test an application for correct encoding (e.g. UTF-8)

1 Comment

Im not quite understanding your statement. From my above example im getting the UTF-8 byte[] of just á to test the the encoding.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.