2

I'm playing with bencoding and I would like to keep bencoded strings as Java strings, but they contain binary data, so blindly converting them to string will corrupt the data. What I am trying to accomplish is to have a conversion function that will keep the ASCII bytes as ASCII and encode non-ASCII chars in a reversible way.

I have found some examples of what I am trying to accomplish in Python but I don't know enough Python to dig through them. This decoder does exactly what I would like to do: ASCII parts of the torrent stay as ASCII, but sha1 hashes are printed as "\xd8r\xe7". Though my Python knowledge is very limited, he doesn't seem to be doing anything special to the string; is this handled by the Python interpreter? Can I accomplish the same in Java?

I have played with some encodings such as Base64 or using Integer.toHexString, but I get unreadable ASCII strings in the end.

I have also found a scheme example that prints everything but the sha1 hashes.

2 Answers 2

2

Bencoded strings are byte strings. You can attempt to decode a byte string to unicode codepoints in Java with String(byte[] bytes, Charset charset). Decoding with certain encodings such as ISO-8859-1 will always succeed, since any byte maps directly to a codepoint. With many of these encodings (including ISO-8859-1) the process is also reversible.

Sign up to request clarification or add additional context in comments.

5 Comments

Yes that's what i'm doing right now but bencoded strings contain binary data not just text, at least in torrents. Building a regular string will corrupt sha1's.
Er, it shouldn't... as long as the codepoints cover the entire 0-255 byte range, nothing should change in the process.
It's a common mistake to believe that ISO-8859-1 does a 1:1-mapping for bytes in the range 0-255. ISO-8859-1 is undefined in the range 128-159, so trying to convert a byte in that range to a character will result in '?' as a best-fit representation of an unknown character.
@jarnbjo, ISO 8859-1 is the encoding that doesn't define some code points, ISO-8859-1 does. The wikipedia article has more details.
@Hamza Yerlikaya, no it won't. If you encode the String again with ISO-8859-1, the resulting bytes are the the same. Or in code: Arrays.equals(bytes, new String(bytes, Charset.forName("ISO-8859-1")).getBytes("ISO-8859-1")) == true for any byte[] bytes.
0

If Wikipedia is accurate on Bencode, the format seems straightforward enough. Parse the byte data directly:

while (true) {
  in.mark(1);
  int n = in.read();
  if (n < 0) {
    // end of input
    break;
  }
  in.reset();
  // take advantage of some UTF-16 values == ASCII values
  if (n == 'd') {
    // parse dictionary
  } else if (n == 'i') {
    // parse int
  } else if (n >= '0' && n <= '9') {
    // parse binary string
  } else if (n == 'l') {
    // parse list
  } else {
    throw new IOException("Invalid input");
  }

Store the binary strings in a type that only converts them to ASCII when you do it explicitly, as in this toString call:

public class ByteString {
  private final byte[] data;

  public ByteString(byte[] data) { this.data = data.clone(); }
  public byte[] getData() { return data.clone(); }

  @Override public String toString() {
    return new String(data, Charset.forName("US-ASCII"));
  }
}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.