0

I have a single Python function that I am using the embed JSON objects are different lengths. The issue I am having is that, somehow, the dimensions are different when comparing the vectors and I have no idea why. First, here is my embedding function:

def get_embeddings(json_object: json) -> list:
    json_splitter = RecursiveJsonSplitter(max_chunk_size=2000)
    json_docs = json_splitter.split_json(json_object, True)
    embeddings = OpenAIEmbeddings(model="text-embedding-3-large", dimensions=3072)
    total_embeddings = []
    for json_doc in json_docs:
        vector_results = embeddings.embed_query(json.dumps(json_doc))
        if vector_results is not None:
            for vector in vector_results:
                total_embeddings.append(vector)
    return total_embeddings

I then save those embeddings in a JSON object with a call such as

json_object["embeddings'] = get_embeddings(input_json)

I wrote a similarity method using numpy that is as follows:

def get_similarity_score(vector_set1, vector_set2) -> float:
    # Convert the vector sets to numpy arrays
    vector_set1 = np.array(vector_set1)
    vector_set2 = np.array(vector_set2)

    # Calculate the cosine similarity between the two vector sets
    dot_product = np.dot(vector_set1, vector_set2)
    norm1 = np.linalg.norm(vector_set1)
    norm2 = np.linalg.norm(vector_set2)
    similarity_score = dot_product / (norm1 * norm2)

    # Map the similarity score to a range of 0 to 10
    similarity_score = (similarity_score + 1) * 5

    # Round the similarity score to two decimal places
    similarity_score = round(similarity_score, 2)

    return similarity_score

I call that method through a call such as

this_score = get_similarity_score(json_object1["embeddings"], json_object2["embeddings"])

This is giving me the error:

ValueError: shapes (30720,) and (21504,) not aligned: 30720 (dim 0) != 21504 (dim 0)

My JSON objects are long and complex so I tried just creating my own JSON that was simpler but followed the pattern list[dict[str, dict]]. That did not work.

I have tried using vector stores such as ChromaDB and Weaviate but the problem persists.

I am fairly sure I am screwing up the embedding somehow which is resulting in the dimension variance but I have no clue how to fix it.

Does anybody have any ideas?

Thank you!

Here is a link to a list of Topics: https://www.dropbox.com/scl/fi/6bcsu1t10o8zj1f8mz4y5/Topics.txt?rlkey=xfznwo7pwtrwixcs2cnwcqx1b&st=hwvhptnq&dl=0

I run each of those first through the embedding and then the get_similiarity functions.

I tried np.reshape but got an error that the array cannot be resized. This article - Cannot reshape array of size into shape - explains that error and why reshape is not an option. I think that my array of vectors in get_embeddings is causing the issue which means I need to somehow force that into a uniform array.
Any ideas? and THANK YOU!

8
  • Where's the error? (full rraceback?). Sounds like a np.dot with two 1d arrays. Review the size (length) of those arrays - before you get to the error line. (before calling get_similarity_score) Commented Jul 3, 2024 at 19:06
  • Without the json inputs (samples?) we can't do much to debug the get_embeddings code. Commented Jul 3, 2024 at 19:11
  • Gotcha - I can never upload files here so I am not sure how to get the JSON to you. The last time I added in a full stack trace I got yelled at by some commenters so I was afraid to do so again. I will work to get both added to the question. I am currently running a test using the numpy reshape option to see if that helps. Commented Jul 3, 2024 at 19:28
  • I just updated with the file for testing. Thank you all for your help! I have been stuck on this for weeks now. Commented Jul 3, 2024 at 19:49
  • While occasionally a full traceback is excessively long, I don't expect that to be the case here. If my guess is right it will just show which line in get_similarity is raising the error. We need that at least. Commented Jul 3, 2024 at 19:59

1 Answer 1

0

As it turns out the answer was very easy - I just used NUMPY resize. First I created all of the embeddings then I went through and found the average size and used resize to make all of the embeddings that size.

Problem solved!

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.