1

I'm reading binary file http://snap.stanford.edu/data/amazon/productGraph/image_features/image_features.b using pyspark.

import array
from io import StringIO

img_embedding_file = sc.binaryRecords("s3://bucket/image_features.b", 4106)

def mapper(features):
    a = array.array('f')
    a.frombytes(features)
    return a.tolist()

def byte_mapper(bytes):
    return str(bytes)

decoded_embeddings = img_embedding_file.map(lambda x: [byte_mapper(x[:10]), mapper(x[10:])])

When just product_id is selected from the rdd using

decoded_embeddings = img_embedding_file.map(lambda x: [byte_mapper(x[:10]), mapper(x[10:])])

The output for product_id is

["b'1582480311'", "b'\\x00\\x00\\x00\\x00\\x88c-?\\xeb\\xe2'", "b'7@\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00'", "b'\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00'", "b'\\xec/\\x0b?\\x00\\x00\\x00\\x00K\\xea'", "b'\\x00\\x00c\\x7f\\xd9?\\x00\\x00\\x00\\x00'", "b'L\\xa6\\n>\\x00\\x00\\x00\\x00\\xfe\\xd4'", "b'\\x00\\x00\\x00\\x00\\x00\\x00\\xe5\\xd0\\xa2='", "b'\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00'", "b'\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00'"]

The file is hosted on s3. The file in each row has first 10 bytes for product_id next 4096 bytes as image_features I'm able to extract all the 4096 image features but facing issue when reading the first 10 bytes and converting it into proper readable format.

2
  • "but facing issue when reading the first 10 bytes", can you add the issue/error you get? Commented Dec 13, 2019 at 11:09
  • @blackbishop added more info on that, I'm not able to decode the byte values for product_id but it is working fine for all the 4096 float array Commented Dec 13, 2019 at 18:47

1 Answer 1

2

EDIT:

Finally, the problem comes from the recordLength. It's not 4096 + 10 but 4096*4 + 10. Chaging to :

img_embedding_file = sc.binaryRecords("s3://bucket/image_features.b", 16394)

Should work. Actually you can find this in the provided code from the web site you downloaded the binary file:

for i in range(4096):
     feature.append(struct.unpack('f', f.read(4))) # <-- so 4096 * 4

Old answer:

I think the issue comes from your byte_mapper function. That's not the correct way to convert bytes to string. You should be using decode:

bytes = b'1582480311'
print(str(bytes))
# output: "b'1582480311'"

print(bytes.decode("utf-8"))
# output: '1582480311'

If you're getting the error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x88 in position 4: invalid start byte

That means product_id string contains non-utf8 characters. If you don't know the input encoding, it's difficult to convert into strings.

However, you may want to ignore those characters by adding option ignore to decode function:

bytes.decode("utf-8", "ignore") 
Sign up to request clarification or add additional context in comments.

8 Comments

I'm able to decode the first record, but it isn't working for rest of the records.
@tourist How do you know it's not working for the rest, you get errors? Also, could you add some records to the question so that we can reproduce it please?
I'm getting the following error UnicodeDecodeError: 'utf-8' codec can't decode byte 0x88 in position 4: invalid start byte
Hemm so this means the input encoding isn't utf8. Please see my edit.
@bloackbishop, I'm not getting the error but the result is not decoded properly. I'm getting something like this ['1582480311', '\x00\x00\x00\x00\x88c-?ëâ', '7@\x00\x00\x00\x00\x00\x00\x00\x00', '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', 'ì/\x0b?\x00\x00\x00\x00Kê', '\x00\x00c\x7fÙ?\x00\x00\x00\x00', 'L¦\n>\x00\x00\x00\x00þÔ', '\x00\x00\x00\x00\x00\x00åТ=', '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'] This is for ascii, utf-8 with ignore option
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.