0

I have to embed over 300,000 products description for a multi-classification project. I split the descriptions onto chunks of 34,337 descriptions to be under the Batch embeddings limit size.

A sample of my jsonl file for batch processing:

{"custom_id": "request-0", "method": "POST", "url": "/v1/embeddings", "body": {"model": "text-embedding-ada-002", "input": "Base L\u00edquida Maybelline Superstay 24 Horas Full Coverage Cor 220 Natural Beige 30ml", "encoding_format": "float"}}
{"custom_id": "request-1", "method": "POST", "url": "/v1/embeddings", "body": {"model": "text-embedding-ada-002", "input": "Sand\u00e1lia Havaianas Top Animals Cinza/Gelo 39/40", "encoding_format": "float"}}

My jsonl file has 34,337 lines.

I've susscesfully uploaded the file:

File 'batch_emb_file_1.jsonl' uploaded succesfully:
 FileObject(id='redacted for work compliance', bytes=6663946, created_at=1720128016, filename='batch_emb_file_1.jsonl', object='file', purpose='batch', status='processed', status_details=None)

and ran the embedding job:

Batch job created successfully:
 Batch(id='redacted for work compliance', completion_window='24h', created_at=1720129886, endpoint='/v1/embeddings', input_file_id='redacted for work compliance', object='batch', status='validating', cancelled_at=None, cancelling_at=None, completed_at=None, error_file_id=None, errors=None, expired_at=None, expires_at=1720216286, failed_at=None, finalizing_at=None, in_progress_at=None, metadata={'description': 'Batch job for embedding large quantity of product descriptions', 'initiated_by': 'Marcio', 'project': 'Product Classification', 'date': '2024-07-04 21:51', 'comments': 'This is the 1 batch job of embeddings'}, output_file_id=None, request_counts=BatchRequestCounts(completed=0, failed=0, total=0))

The work was completed:

client.batches.retrieve(batch_job_1.id).status
'completed'

client.batches.retrieve('redacted for work compliance'), returns:

Batch(id='redacted for work compliance', completion_window='24h', created_at=1720129886, endpoint='/v1/embeddings', input_file_id='redacted for work compliance', object='batch', status='completed', cancelled_at=None, cancelling_at=None, completed_at=1720135956, error_file_id=None, errors=None, expired_at=None, expires_at=1720216286, failed_at=None, finalizing_at=1720133521, in_progress_at=1720129903, metadata={'description': 'Batch job for embedding large quantity of product descriptions', 'initiated_by': 'Marcio', 'project': 'Product Classification', 'date': '2024-07-04 21:51', 'comments': 'This is the 1 batch job of embeddings'}, output_file_id='redacted for work compliance', request_counts=BatchRequestCounts(completed=34337, failed=0, total=34337))

But when I try to get the content using output_file_id string

client.files.content(value of output_file_id), returns:

<openai._legacy_response.HttpxBinaryResponseContent at 0x79ae81ec7d90>

I have tried: client.files.content(value of output_file_id).content but this kills my kernel

What am I doing wrong? Also I believe I am under utilizing Batch embeddings. the 90,000 limits conflicts with Batch Queue Limit of 'text-embedding-ada-002' model which is: 3,000,000

Could someone help?

1 Answer 1

0

Retrieving the embedding data from batch file is a bit trick, this Tutorial breaks it down set by set link

after getting the output_file_id, you need to:

output_file =client.files.content(output_files_id).text

embedding_results = []
for line in output_file.split('\n')[:-1]:
            data =json.loads(line)
            custom_id = data.get('custom_id')
            embedding = data['response']['body']['data'][0]['embedding']
            embedding_results.append([custom_id, embedding])


embedding_results = pd.DataFrame(embedding_results, columns=['custom_id', 'embedding'])

In my case, this retrieves the embedding data from the batch job file

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.