Why are my dimensions different when using OpenAi embeddings in Python?

Question

I have a single Python function that I am using the embed JSON objects are different lengths. The issue I am having is that, somehow, the dimensions are different when comparing the vectors and I have no idea why. First, here is my embedding function:

def get_embeddings(json_object: json) -> list:
    json_splitter = RecursiveJsonSplitter(max_chunk_size=2000)
    json_docs = json_splitter.split_json(json_object, True)
    embeddings = OpenAIEmbeddings(model="text-embedding-3-large", dimensions=3072)
    total_embeddings = []
    for json_doc in json_docs:
        vector_results = embeddings.embed_query(json.dumps(json_doc))
        if vector_results is not None:
            for vector in vector_results:
                total_embeddings.append(vector)
    return total_embeddings

I then save those embeddings in a JSON object with a call such as

json_object["embeddings'] = get_embeddings(input_json)

I wrote a similarity method using numpy that is as follows:

def get_similarity_score(vector_set1, vector_set2) -> float:
    # Convert the vector sets to numpy arrays
    vector_set1 = np.array(vector_set1)
    vector_set2 = np.array(vector_set2)

    # Calculate the cosine similarity between the two vector sets
    dot_product = np.dot(vector_set1, vector_set2)
    norm1 = np.linalg.norm(vector_set1)
    norm2 = np.linalg.norm(vector_set2)
    similarity_score = dot_product / (norm1 * norm2)

    # Map the similarity score to a range of 0 to 10
    similarity_score = (similarity_score + 1) * 5

    # Round the similarity score to two decimal places
    similarity_score = round(similarity_score, 2)

    return similarity_score

I call that method through a call such as

this_score = get_similarity_score(json_object1["embeddings"], json_object2["embeddings"])

This is giving me the error:

ValueError: shapes (30720,) and (21504,) not aligned: 30720 (dim 0) != 21504 (dim 0)

My JSON objects are long and complex so I tried just creating my own JSON that was simpler but followed the pattern list[dict[str, dict]]. That did not work.

I have tried using vector stores such as ChromaDB and Weaviate but the problem persists.

I am fairly sure I am screwing up the embedding somehow which is resulting in the dimension variance but I have no clue how to fix it.

Does anybody have any ideas?

Thank you!

Here is a link to a list of Topics: https://www.dropbox.com/scl/fi/6bcsu1t10o8zj1f8mz4y5/Topics.txt?rlkey=xfznwo7pwtrwixcs2cnwcqx1b&st=hwvhptnq&dl=0

I run each of those first through the embedding and then the get_similiarity functions.

I tried np.reshape but got an error that the array cannot be resized. This article - Cannot reshape array of size into shape - explains that error and why reshape is not an option. I think that my array of vectors in get_embeddings is causing the issue which means I need to somehow force that into a uniform array.
Any ideas? and THANK YOU!

Where's the error? (full rraceback?). Sounds like a np.dot with two 1d arrays. Review the size (length) of those arrays - before you get to the error line. (before calling get_similarity_score) — hpaulj
– hpaulj, Commented Jul 3, 2024 at 19:06
Without the json inputs (samples?) we can't do much to debug the get_embeddings code. — hpaulj
– hpaulj, Commented Jul 3, 2024 at 19:11
Gotcha - I can never upload files here so I am not sure how to get the JSON to you. The last time I added in a full stack trace I got yelled at by some commenters so I was afraid to do so again. I will work to get both added to the question. I am currently running a test using the numpy reshape option to see if that helps. — Ken Tola
– Ken Tola, Commented Jul 3, 2024 at 19:28
I just updated with the file for testing. Thank you all for your help! I have been stuck on this for weeks now. — Ken Tola
– Ken Tola, Commented Jul 3, 2024 at 19:49
While occasionally a full traceback is excessively long, I don't expect that to be the case here. If my guess is right it will just show which line in get_similarity is raising the error. We need that at least. — hpaulj
– hpaulj, Commented Jul 3, 2024 at 19:59

Ken Tola · Accepted Answer · 2024-07-10 00:18:47Z

0

As it turns out the answer was very easy - I just used NUMPY resize. First I created all of the embeddings then I went through and found the average size and used resize to make all of the embeddings that size.

Problem solved!

answered Jul 10, 2024 at 0:18

Ken Tola

292 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Why are my dimensions different when using OpenAi embeddings in Python?

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related