I have a single Python function that I am using the embed JSON objects are different lengths. The issue I am having is that, somehow, the dimensions are different when comparing the vectors and I have no idea why. First, here is my embedding function:
def get_embeddings(json_object: json) -> list:
json_splitter = RecursiveJsonSplitter(max_chunk_size=2000)
json_docs = json_splitter.split_json(json_object, True)
embeddings = OpenAIEmbeddings(model="text-embedding-3-large", dimensions=3072)
total_embeddings = []
for json_doc in json_docs:
vector_results = embeddings.embed_query(json.dumps(json_doc))
if vector_results is not None:
for vector in vector_results:
total_embeddings.append(vector)
return total_embeddings
I then save those embeddings in a JSON object with a call such as
json_object["embeddings'] = get_embeddings(input_json)
I wrote a similarity method using numpy that is as follows:
def get_similarity_score(vector_set1, vector_set2) -> float:
# Convert the vector sets to numpy arrays
vector_set1 = np.array(vector_set1)
vector_set2 = np.array(vector_set2)
# Calculate the cosine similarity between the two vector sets
dot_product = np.dot(vector_set1, vector_set2)
norm1 = np.linalg.norm(vector_set1)
norm2 = np.linalg.norm(vector_set2)
similarity_score = dot_product / (norm1 * norm2)
# Map the similarity score to a range of 0 to 10
similarity_score = (similarity_score + 1) * 5
# Round the similarity score to two decimal places
similarity_score = round(similarity_score, 2)
return similarity_score
I call that method through a call such as
this_score = get_similarity_score(json_object1["embeddings"], json_object2["embeddings"])
This is giving me the error:
ValueError: shapes (30720,) and (21504,) not aligned: 30720 (dim 0) != 21504 (dim 0)
My JSON objects are long and complex so I tried just creating my own JSON that was simpler but followed the pattern list[dict[str, dict]]. That did not work.
I have tried using vector stores such as ChromaDB and Weaviate but the problem persists.
I am fairly sure I am screwing up the embedding somehow which is resulting in the dimension variance but I have no clue how to fix it.
Does anybody have any ideas?
Thank you!
Here is a link to a list of Topics: https://www.dropbox.com/scl/fi/6bcsu1t10o8zj1f8mz4y5/Topics.txt?rlkey=xfznwo7pwtrwixcs2cnwcqx1b&st=hwvhptnq&dl=0
I run each of those first through the embedding and then the get_similiarity functions.
I tried np.reshape but got an error that the array cannot be resized. This article - Cannot reshape array of size into shape - explains that error and why reshape is not an option.
I think that my array of vectors in get_embeddings is causing the issue which means I need to somehow force that into a uniform array.
Any ideas? and THANK YOU!
np.dotwith two 1d arrays. Review the size (length) of those arrays - before you get to the error line. (before callingget_similarity_score)get_embeddingscode.get_similarityis raising the error. We need that at least.