6

I have a array of raw bytes which i need to tokenize to a list of byte array in java. Explained better by the following method declaration.

public static List<byte[]> splitMessage(byte[] rawByte, String tokenDelimiter)

Example runs.

Example Run 1:

Raw byte

byte[] rawBytes = new byte[]{72,118,121,49,85,118,97,113,111,124,44,124,49,48,43,57,48,36,63,49,66,70,22,18,124,44,124,23,27,25,54,24,24,34,44,57,69,66,49,47,66,16,39,35,32,36,30,50,63,124,44,124,16,18,24,64,4,94,124,44,124,19,31,42,55,66,46,34,62,34,37};

tokenDelimiter is |,| (i.e 124,44,124)

So the List returned is as:

Token 1: 72,118,121,49,85,118,97,113,111
Token 2: 49,48,43,57,48,36,63,49,66,70,22,18
Token 3: 23,27,25,54,24,24,34,44,57,69,66,49,47,66,16,39,35,32,36,30,50,63,
Token 4: 16,18,24,64,4,94
Token 5: 19,31,42,55,66,46,34,62,34,37

Example Run 2:

byte[] rawBytes = new byte[]{72,118,121,49,85,118,97,113,111,124,44,124,49,48,43,57,48,36,63,49,66,70,22,18,124,44,124,124,44,124,23,27,25,54,24,24,34,44,57,69,66,49,47,66,16,39,35,32,36,30,50,63,124,44,124,16,18,24,64,4,94,124,44,124,19,31,42,55,66,46,34,62,34,37,124,44,124,124,44,124};

tokenDelimiter is |,| (i.e 124,44,124)

Token 1: 72,118,121,49,85,118,97,113,111
Token 2: 49,48,43,57,48,36,63,49,66,70,22,18
Token 3: <Empty>
Token 3: 23,27,25,54,24,24,34,44,57,69,66,49,47,66,16,39,35,32,36,30,50,63,
Token 4: 16,18,24,64,4,94
Token 5: 19,31,42,55,66,46,34,62,34,37
Token 6: <Empty>
Token 7: <Empty> 

I am able to achive example run from following code snippet. But stuck with tags in the second one.

public static List<byte[]> splitMessageSept19(byte[] rawByte, String tokenDelimiter) throws UnsupportedEncodingException
{
    List<byte[]> tokens = new ArrayList<byte[]>();

    final byte[] byteArray = tokenDelimiter.getBytes("UTF-8");
    final byte byteDelimitorFirstByte  = byteArray[0];

    int bytenum =0 ;
    int lastIndex = 0;
    int storIterator =0;
    for ( int iterator = 0 ; iterator <= rawByte.length ; iterator++ )
    {
        if (iterator == rawByte.length || rawByte[iterator] == byteDelimitorFirstByte)
        {
            storIterator = iterator;
            if ( iterator != rawByte.length )
            {
                for ( int i=0 ; i < byteArray.length ; i++ )
                {
                    if ( rawByte[iterator] == byteArray[i] )
                    {
                        iterator++ ;
                        continue;
                    }
                    else
                    {
                        break;
                    }
                }
            }
            byte[] byteArrayExtracted = new byte[storIterator - lastIndex];
            System.arraycopy(rawByte, lastIndex, byteArrayExtracted, 0, 
                             storIterator - lastIndex);
            lastIndex = iterator ;
            tokens.add(byteArrayExtracted);
            byteArrayExtracted = null;
        }
    }
    for ( byte[] bytetoken : tokens )
    {
        System.out.println("Token received is: " + new String(bytetoken, "UTF-8"));
    }
    return tokens;
}

Has anyone faced a similar problem of tokenizing arrays? Please suggest if there is some other way to tokenize arrays.

Please note: I don't want convert the byte stream to String, tokenize in String format and convert back to bytes. It may have its on problems of encoding.

1
  • why you not just skipping empty tokens in your code? Commented Sep 19, 2012 at 10:36

2 Answers 2

3

If you use ISO-8859-1 then bytes are preserved as they were originally.

private static final Charset ISO_8859_1 = Charset.forName("ISO-8859-1");

public static List<byte[]> splitMessageSept19(byte[] rawByte, String tokenDelimiter) {
    Pattern pattern = Pattern.compile(tokenDelimiter, Pattern.LITERAL);
    String[] parts = pattern.split(new String(rawByte, ISO_8859_1), -1);
    List<byte[]> ret = new ArrayList<byte[]>();
    for (String part : parts) 
        ret.add(part.getBytes(ISO_8859_1));
    return ret;
}

public static void main(String... args) {
    StringBuilder sb = new StringBuilder();
    for(int i=0;i<256;i++)
        sb.append((char) i);
    byte[] bytes = sb.toString().getBytes(ISO_8859_1);
    List<byte[]> list = splitMessageSept19(bytes, ",");
    for (byte[] b : list) 
        System.out.println(Arrays.toString(b));
}

prints

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43] [45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, -128, -127, -126, -125, -124, -123, -122, -121, -120, -119, -118, -117, -116, -115, -114, -113, -112, -111, -110, -109, -108, -107, -106, -105, -104, -103, -102, -101, -100, -99, -98, -97, -96, -95, -94, -93, -92, -91, -90, -89, -88, -87, -86, -85, -84, -83, -82, -81, -80, -79, -78, -77, -76, -75, -74, -73, -72, -71, -70, -69, -68, -67, -66, -65, -64, -63, -62, -61, -60, -59, -58, -57, -56, -55, -54, -53, -52, -51, -50, -49, -48, -47, -46, -45, -44, -43, -42, -41, -40, -39, -38, -37, -36, -35, -34, -33, -32, -31, -30, -29, -28, -27, -26, -25, -24, -23, -22, -21, -20, -19, -18, -17, -16, -15, -14, -13, -12, -11, -10, -9, -8, -7, -6, -5, -4, -3, -2, -1]

Calling

byte[] rawBytes = new byte[]{72,118,121,49,85,118,97,113,111,124,44,124,49,48,43,57,48,36,63,49,66,70,22,18,124,44,124,124,44,124,23,27,25,54,24,24,34,44,57,69,66,49,47,66,16,39,35,32,36,30,50,63,124,44,124,16,18,24,64,4,94,124,44,124,19,31,42,55,66,46,34,62,34,37,124,44,124,124,44,124};
List<byte[]> list = splitMessageSept19(rawBytes, "|,|");

produces

[72, 118, 121, 49, 85, 118, 97, 113, 111]
[49, 48, 43, 57, 48, 36, 63, 49, 66, 70, 22, 18]
[]
[23, 27, 25, 54, 24, 24, 34, 44, 57, 69, 66, 49, 47, 66, 16, 39, 35, 32, 36, 30, 50, 63]
[16, 18, 24, 64, 4, 94]
[19, 31, 42, 55, 66, 46, 34, 62, 34, 37]
[]
[]
Sign up to request clarification or add additional context in comments.

2 Comments

I was looking for an algorithm that don't convert the byte stream to String, tokenize in String format and convert back to bytes. As i was fearing that there would be encoding issues with it. I tried your solution and this doesn't give encoding issues. Thanks for the quick reply.
The trick is to use ISO-8859-1 encoding which just maps n => n provided 0 <= n <= 255.
0

You should take a look at the KMP algorithm: KMP on wikipedia and other string search algorithms as well.

As a quick fix to your code try this:

public static List<byte[]> splitMessageSept19(byte[] rawByte, String tokenDelimiter) throws UnsupportedEncodingException
{
    List<byte[]> tokens = new ArrayList<byte[]>();

    final byte[] byteArray = tokenDelimiter.getBytes("UTF-8");
    int lastIndex = 0;

    for (int iterator = 0; iterator < rawByte.length - byteArray.length + 1; )
    {
        boolean patternFound = true;
        for (int i = 0; i < byteArray.length; i++)
        {
            if (rawByte[iterator + i] != byteArray[i])
            {
                patternFound = false;
                break;
            }
        }
        if (patternFound)
        {
            byte[] byteArrayExtracted = new byte[iterator - lastIndex];
            System.arraycopy(rawByte, lastIndex, byteArrayExtracted, 0, iterator - lastIndex);
            iterator += byteArray.length;
            lastIndex = iterator;
            tokens.add(byteArrayExtracted);
        }
        else
            iterator++;

    }
    for (byte[] bytetoken : tokens)
    {
        System.out.println("Token received is: " + new String(bytetoken, "UTF-8"));
    }
    return tokens;
}

I haven't compiled this code so it might be broken, but even though I hope you get the idea.

This is a naive algorithm which is pretty slow especially if the delimiter is long. If you want something better go take a look at some other string search algorithms.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.