How to read binary data in pyspark

Question

I'm reading binary file http://snap.stanford.edu/data/amazon/productGraph/image_features/image_features.b using pyspark.

import array
from io import StringIO

img_embedding_file = sc.binaryRecords("s3://bucket/image_features.b", 4106)

def mapper(features):
    a = array.array('f')
    a.frombytes(features)
    return a.tolist()

def byte_mapper(bytes):
    return str(bytes)

decoded_embeddings = img_embedding_file.map(lambda x: [byte_mapper(x[:10]), mapper(x[10:])])

When just product_id is selected from the rdd using

decoded_embeddings = img_embedding_file.map(lambda x: [byte_mapper(x[:10]), mapper(x[10:])])

The output for product_id is

["b'1582480311'", "b'\\x00\\x00\\x00\\x00\\x88c-?\\xeb\\xe2'", "b'7@\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00'", "b'\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00'", "b'\\xec/\\x0b?\\x00\\x00\\x00\\x00K\\xea'", "b'\\x00\\x00c\\x7f\\xd9?\\x00\\x00\\x00\\x00'", "b'L\\xa6\\n>\\x00\\x00\\x00\\x00\\xfe\\xd4'", "b'\\x00\\x00\\x00\\x00\\x00\\x00\\xe5\\xd0\\xa2='", "b'\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00'", "b'\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00'"]

The file is hosted on s3. The file in each row has first 10 bytes for product_id next 4096 bytes as image_features I'm able to extract all the 4096 image features but facing issue when reading the first 10 bytes and converting it into proper readable format.

"but facing issue when reading the first 10 bytes", can you add the issue/error you get? — blackbishop
– blackbishop, Commented Dec 13, 2019 at 11:09
@blackbishop added more info on that, I'm not able to decode the byte values for product_id but it is working fine for all the 4096 float array — tourist
– tourist, Commented Dec 13, 2019 at 18:47

blackbishop · Accepted Answer · 2019-12-13 22:56:18Z

2

EDIT:

Finally, the problem comes from the recordLength. It's not 4096 + 10 but 4096*4 + 10. Chaging to :

img_embedding_file = sc.binaryRecords("s3://bucket/image_features.b", 16394)

Should work. Actually you can find this in the provided code from the web site you downloaded the binary file:

for i in range(4096):
     feature.append(struct.unpack('f', f.read(4))) # <-- so 4096 * 4

Old answer:

I think the issue comes from your byte_mapper function. That's not the correct way to convert bytes to string. You should be using decode:

bytes = b'1582480311'
print(str(bytes))
# output: "b'1582480311'"

print(bytes.decode("utf-8"))
# output: '1582480311'

If you're getting the error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x88 in position 4: invalid start byte

That means product_id string contains non-utf8 characters. If you don't know the input encoding, it's difficult to convert into strings.

However, you may want to ignore those characters by adding option ignore to decode function:

bytes.decode("utf-8", "ignore")

edited Dec 13, 2019 at 22:56

answered Dec 13, 2019 at 19:30

blackbishop

32.8k11 gold badges61 silver badges86 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

tourist Over a year ago

I'm able to decode the first record, but it isn't working for rest of the records.

blackbishop Over a year ago

@tourist How do you know it's not working for the rest, you get errors? Also, could you add some records to the question so that we can reproduce it please?

tourist Over a year ago

I'm getting the following error UnicodeDecodeError: 'utf-8' codec can't decode byte 0x88 in position 4: invalid start byte

blackbishop Over a year ago

Hemm so this means the input encoding isn't utf8. Please see my edit.

tourist Over a year ago

@bloackbishop, I'm not getting the error but the result is not decoded properly. I'm getting something like this

['1582480311', '\x00\x00\x00\x00\x88c-?ëâ', '7@\x00\x00\x00\x00\x00\x00\x00\x00', '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', 'ì/\x0b?\x00\x00\x00\x00Kê', '\x00\x00c\x7fÙ?\x00\x00\x00\x00', 'L¦\n>\x00\x00\x00\x00þÔ', '\x00\x00\x00\x00\x00\x00åÐ¢=', '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00']

This is for ascii, utf-8 with ignore option

|

Collectives™ on Stack Overflow

How to read binary data in pyspark

1 Answer 1

8 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

8 Comments

Your Answer

Sign up or log in

Post as a guest

Related