1

My Java program (or rather, a part of it) sends a request to a webservice and receives rdf-strings including ancient Greek words in unicode. I wrote the program in netbeans and so far, there has not been a problem during run-time, both in the netbeans environment and outside as a standalone jar under Linux and Windows XP. Now, all of a sudden the Greek words in the rdf come back garbled like this:

á¼€

At first, I thought this was a Windows XP problem, but when checking under Windows 7 the problem persisted. I found out that I was running OpenJDK under Linux, and was since able to reproduce the issue using Oracle Java. This is the relevant code (of course, I may have tunnel vision, so please tell me if you need more):

try {
        HttpClient client = new DefaultHttpClient();
        HttpGet get;
        get = new HttpGet(URL+URLEncoder.encode(form, "UTF-8"));

        HttpResponse response = client.execute(get);
        if (201 == response.getStatusLine().getStatusCode()) {
            HttpEntity respEnt = response.getEntity();
            BufferedReader reader = new BufferedReader(new InputStreamReader(respEnt.getContent()));
            StringBuilder sb = new StringBuilder();
            char[] cbuffer = new char[256];
            int read;

            while ((read = reader.read(cbuffer)) != -1) {
                sb.append(cbuffer,0,read);
            }
            //System.out.println(sb.toString());
            rdf = new String(sb.toString().getBytes("UTF-8"),"UTF-8");

        } else {
            System.err.println("HTTP Request fehlgeschlagen.");
        }         

    } catch (IOException e) {
        System.err.println("Problem beim HTTP Request.");
    }

The webservice is the Perseus morphology service, it can be found here: http://services.perseids.org/bsp/morphologyservice/analysis/word?lang=grc&engine=morpheusgrc&word=. Try "word=μῆνιν", for example. How or when the rdf is generated, I really don't know.

I would be very grateful for further insights!

12
  • When did you notice this occurring? I thought XP hasn't gotten any updates recently, save for that one IE security update. Does it happen with other versions of Windows? Commented May 12, 2014 at 8:50
  • I first noticed it on April 4th, so just before the end of the XP lifecycle. I have been trying to solve this on my own for the last month or so. Unfortunately, I do not have other versions of Windows at hand! Commented May 12, 2014 at 9:04
  • Ah, I see. Have you tried messing with the encodings and seeing if you can get something other than gibberish? And have you verified that the bytes you're getting back from the server are the same on all the machines you tried? Also, if you're willing to trust an internet stranger I could try your code on Windows 7 sometime later today too. Commented May 12, 2014 at 9:11
  • I do get other things than gibberish, the rest of the rdf string is just fine. How would I go about verifying the bytes? Do you think they change depending on from where they are called? Thanks for offering to try the code, however, I'll be able to check out a Windows 7 machine tomorrow with a colleague! Commented May 12, 2014 at 9:46
  • Oh, so it's just the Greek that's messed up? You might be able to verify bytes by using a BufferedInputStream and the [read()](docs.oracle.com/javase/7/docs/api/java/io/…, int, int)) to read into a byte array, which you can then print to see exactly what you're getting. I'm not really sure whether what you get back will change depending on the machine you get a response from, but this way you can at least be sure that if the message you receive is the same, then it's not the server doing something wonky. I wouldn't call it a necessary step though. Commented May 12, 2014 at 9:57

1 Answer 1

3
+50

Make sure the encoding of your strings is consistent from client to server and back again. In your case of course the servers response (rdf-strings) is most important (encoding serveside, decoding in your client code).

One thing concerning the client code you posted : You are using the one argument constructor of InputStreamReader in this line:

BufferedReader reader = new BufferedReader(new InputStreamReader(respEnt.getContent()));

It will read from the inputstream using the VM (and systems) default charset, so the outcome will depend on the machine/VM you are running your client application on. Try explicitly setting the charset using this constructor

new InputStreamReader(url.openStream(), "UTF-8")

See also API-doc.

Search your code for more uses of the one argument constructor of both InputStreamReader and OutputStreamWriter, which also uses the default encoding.

If you have no control over the server code (the webservice implementation), you can try to find out the answers charset like this:

Header contentType = response.getFirstHeader("Content-Type");
String charset= contentType.getValue();

(This is from the apache HttpClient API you seem to be using). see also this Q on SO.

Sign up to request clarification or add additional context in comments.

6 Comments

I hadn't thought about that... Does the default charset differ between OpenJDK and Oracle's implementation though?
@user3580294 the locale can be set by the user on each computer, this locale will determine the default charset used by the system.
@RossBille I'm aware of that, but considering the issues that OP is having, I'm not totally sure that that is the problem. Seems like the program used to work on Windows until something broke it some time ago, and from what I know Windows doesn't randomly change its default charset. Not to mention OP's comment where the program's output changed from run to run, but that could be an anomaly...
@user3580294 changing the JDK impl is potentially affecting many things on different "levels" I guess. At first glance both impls seem to pick up the charset and default encoding from the OS and can be influenced by system properties e.g. "-Dfile.encoding".
@ithofm I suppose... Guess we'll have to see whether it solves OP's problem. There are still a few things that I find strange, like how different runs of the program seem to give different results, that make me want to think that encoding might not be the issue (or not the only issue)...
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.