bencoding binary data in Java strings

Question

I'm playing with bencoding and I would like to keep bencoded strings as Java strings, but they contain binary data, so blindly converting them to string will corrupt the data. What I am trying to accomplish is to have a conversion function that will keep the ASCII bytes as ASCII and encode non-ASCII chars in a reversible way.

I have found some examples of what I am trying to accomplish in Python but I don't know enough Python to dig through them. This decoder does exactly what I would like to do: ASCII parts of the torrent stay as ASCII, but sha1 hashes are printed as "\xd8r\xe7". Though my Python knowledge is very limited, he doesn't seem to be doing anything special to the string; is this handled by the Python interpreter? Can I accomplish the same in Java?

I have played with some encodings such as Base64 or using Integer.toHexString, but I get unreadable ASCII strings in the end.

I have also found a scheme example that prints everything but the sha1 hashes.

Tuure Laurinolli · Accepted Answer · 2009-11-02 22:31:40Z

2

Bencoded strings are byte strings. You can attempt to decode a byte string to unicode codepoints in Java with String(byte[] bytes, Charset charset). Decoding with certain encodings such as ISO-8859-1 will always succeed, since any byte maps directly to a codepoint. With many of these encodings (including ISO-8859-1) the process is also reversible.

answered Nov 2, 2009 at 22:31

Tuure Laurinolli

4,1171 gold badge24 silver badges22 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Hamza Yerlikaya Over a year ago

Yes that's what i'm doing right now but bencoded strings contain binary data not just text, at least in torrents. Building a regular string will corrupt sha1's.

Amber Over a year ago

Er, it shouldn't... as long as the codepoints cover the entire 0-255 byte range, nothing should change in the process.

jarnbjo Over a year ago

It's a common mistake to believe that ISO-8859-1 does a 1:1-mapping for bytes in the range 0-255. ISO-8859-1 is undefined in the range 128-159, so trying to convert a byte in that range to a character will result in '?' as a best-fit representation of an unknown character.

Tuure Laurinolli Over a year ago

@jarnbjo, ISO 8859-1 is the encoding that doesn't define some code points, ISO-8859-1 does. The wikipedia article has more details.

Tuure Laurinolli Over a year ago

@Hamza Yerlikaya, no it won't. If you encode the String again with ISO-8859-1, the resulting bytes are the the same. Or in code: Arrays.equals(bytes, new String(bytes, Charset.forName("ISO-8859-1")).getBytes("ISO-8859-1")) == true for any byte[] bytes.

McDowell · Accepted Answer · 2009-11-02 23:11:21Z

If Wikipedia is accurate on Bencode, the format seems straightforward enough. Parse the byte data directly:

while (true) {
  in.mark(1);
  int n = in.read();
  if (n < 0) {
    // end of input
    break;
  }
  in.reset();
  // take advantage of some UTF-16 values == ASCII values
  if (n == 'd') {
    // parse dictionary
  } else if (n == 'i') {
    // parse int
  } else if (n >= '0' && n <= '9') {
    // parse binary string
  } else if (n == 'l') {
    // parse list
  } else {
    throw new IOException("Invalid input");
  }

Store the binary strings in a type that only converts them to ASCII when you do it explicitly, as in this toString call:

public class ByteString {
  private final byte[] data;

  public ByteString(byte[] data) { this.data = data.clone(); }
  public byte[] getData() { return data.clone(); }

  @Override public String toString() {
    return new String(data, Charset.forName("US-ASCII"));
  }
}

Collectives™ on Stack Overflow

bencoding binary data in Java strings

2 Answers 2

5 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related