I have been learning rust sporadically for a while now and decided to write some toy projects. While browsing https://github.com/codecrafters-io/build-your-own-x I came across some bittorrent client examples. As they contain many moving parts like parsing, networking and multithreading it seemed like a good idea for a mini toy project. I started with a decoder/parser for bencoding format since it is probably the simplest and most intuitive part.
I initially implemented it using nom since I was familiar with it. Here is the implementation if you're curious: https://github.com/centaurwho/domenec/blob/prototype/domenec/src/bencode_nom.rs. Later I changed that since it was usually too slow and wanted to play with bytes. After implementing it with using the augmented BNF from https://hackage.haskell.org/package/bencoding-0.4.3.0/docs/Data-BEncode.html as reference, I now have a working implementation.
Current parser has some basic error handling and many unit tests to confirm it works as intended. Since the official documentation of the format is not very good, I have assumed some details during implementation and later cross-checked using this python library's unit tests: https://github.com/fuzeman/bencode.py/tree/master/tests
Here's the code:
#[derive(Debug, Clone, Eq, PartialEq)]
pub enum DecodingError {
Err,
MissingIdentifier(char),
KeyWithoutValue(String),
StringWithoutLength,
NotANumber,
EndOfFile,
NegativeZero,
}
type Result<T> = std::result::Result<T, DecodingError>;
#[derive(Debug, Eq, PartialEq)]
pub enum BEncodingType {
Integer(i64),
String(String),
List(Vec<BEncodingType>),
Dictionary(LinkedHashMap<String, BEncodingType>),
}
pub struct BDecoder<'a> {
bytes: &'a [u8],
cursor: usize,
}
impl BDecoder<'_> {
fn new(bytes: &[u8]) -> BDecoder {
BDecoder { bytes, cursor: 0 }
}
fn decode(&mut self) -> Result<BEncodingType> {
self.parse_type()
}
fn parse_str(&mut self) -> Result<String> {
let len = self.read_num().or(Err(DecodingError::StringWithoutLength))?;
self.expect_char(b':')?;
let start = self.cursor;
let end = start + len as usize;
if end > self.bytes.len() {
self.cursor = self.bytes.len();
return Err(DecodingError::EndOfFile);
}
self.cursor = end;
Ok(String::from_utf8_lossy(&self.bytes[start..end]).to_string())
}
fn parse_int(&mut self) -> Result<i64> {
self.expect_char(b'i')?;
let i = self.read_num()?;
self.expect_char(b'e')?;
Ok(i)
}
fn parse_list(&mut self) -> Result<Vec<BEncodingType>> {
self.expect_char(b'l')?;
let mut list = Vec::new();
while self.peek().filter(|&c| c != b'e').is_some() {
list.push(self.parse_type()?);
}
self.expect_char(b'e')?;
Ok(list)
}
fn parse_dict(&mut self) -> Result<LinkedHashMap<String, BEncodingType>> {
self.expect_char(b'd')?;
let mut dict = LinkedHashMap::new();
while self.peek().filter(|&c| c != b'e').is_some() {
let key = self.parse_str()?;
let value = self.parse_type()
.map_err(|_| DecodingError::KeyWithoutValue(key.clone()))?;
dict.insert(key, value);
}
self.expect_char(b'e')?;
Ok(dict)
}
fn parse_type(&mut self) -> Result<BEncodingType> {
match self.peek() {
None => Err(DecodingError::Err),
Some(b'i') => self.parse_int().map(BEncodingType::Integer),
Some(b'l') => self.parse_list().map(BEncodingType::List),
Some(b'd') => self.parse_dict().map(BEncodingType::Dictionary),
Some(_) => self.parse_str().map(BEncodingType::String)
}
}
fn read_num(&mut self) -> Result<i64> {
let mut neg_const = 1;
if self.peek() == Some(b'-') {
neg_const = -1;
self.cursor += 1;
}
// FIXME: Consider a cleaner early return here, not happy with the catchall
match self.peek() {
None => Err(DecodingError::EndOfFile),
Some(chr) if !chr.is_ascii_digit() => Err(DecodingError::NotANumber),
Some(chr) if neg_const == -1 && chr == b'0' => Err(DecodingError::NegativeZero),
_ => Ok(())
}?;
let mut acc = 0;
while let Some(v) = self.peek() {
if v.is_ascii_digit() {
acc = acc * 10 + (v - b'0') as i64;
self.cursor += 1;
} else {
break;
}
};
Ok(acc * neg_const)
}
fn expect_char(&mut self, expected: u8) -> Result<u8> {
match self.peek() {
None => Err(DecodingError::EndOfFile),
Some(chr) if chr == expected => self.advance(),
_ => Err(DecodingError::MissingIdentifier(expected as char)),
}
}
fn peek(&mut self) -> Option<u8> {
self.bytes.get(self.cursor).cloned()
}
fn advance(&mut self) -> Result<u8> {
let v = self.bytes.get(self.cursor).cloned();
self.cursor += 1;
v.ok_or(DecodingError::EndOfFile)
}
}
pub fn decode(inp: &[u8]) -> Result<BEncodingType> {
let mut parser = BDecoder::new(inp);
parser.decode()
}
Code is pretty self explanatory, but if you have any question I can gladly help.
I know this is too long and may not result in a lot of reviews but I didn't want to exclude any part for completeness sake. There are some points where I am not sure is good practice or idiomatic rust. Some irks I have:
- I am not a fan of classes with one public method in other languages. So I considered not having a struct at all and passing the bytes from function to function.
- I considered using a bytes iterator instead of having a
bytesandcursorfield. But not sure if it would improve the code at all. - I am not sure if error handling is clean enough. Particularly, is the amount of custom error kinds in
DecodingErrornecessary. I know for most languages custom errors are usually not recommended or having only 1 or 2 is enough. - Also on error handling, in the function
read_numI am using amatchexpression combined with?to early return from the function. I don't like creatingOk(())in the catchall arm and throwing it away in the next line. I feel like there should be a better way.
These were just some questions I had while writing it and I probably missed some others. Would really appreciate a review.